Austin Schuh | 8d0a285 | 2019-12-28 22:54:28 -0800 | [diff] [blame] | 1 | % TEMPLATE for Usenix papers, specifically to meet requirements of |
| 2 | % TCL97 committee. |
| 3 | % originally a template for producing IEEE-format articles using LaTeX. |
| 4 | % written by Matthew Ward, CS Department, Worcester Polytechnic Institute. |
| 5 | % adapted by David Beazley for his excellent SWIG paper in Proceedings, |
| 6 | % Tcl 96 |
| 7 | % turned into a smartass generic template by De Clarke, with thanks to |
| 8 | % both the above pioneers |
| 9 | % use at your own risk. Complaints to /dev/null. |
| 10 | % make it two column with no page numbering, default is 10 point |
| 11 | |
| 12 | % Munged by Fred Douglis <douglis@research.att.com> 10/97 to separate |
| 13 | % the .sty file from the LaTeX source template, so that people can |
| 14 | % more easily include the .sty file into an existing document. Also |
| 15 | % changed to more closely follow the style guidelines as represented |
| 16 | % by the Word sample file. |
| 17 | % This version uses the latex2e styles, not the very ancient 2.09 stuff. |
| 18 | |
| 19 | % adapted for Ottawa Linux Symposium |
| 20 | |
| 21 | \documentclass[twocolumn]{article} |
| 22 | \usepackage{ols,epsfig} |
| 23 | \begin{document} |
| 24 | |
| 25 | %Remove this next line if your system defaults correctly. |
| 26 | \special{papersize=8.5in,11in} |
| 27 | |
| 28 | %don't want date printed |
| 29 | \date{} |
| 30 | |
| 31 | %make title bold and 14 pt font (Latex default is non-bold, 16 pt) |
| 32 | \title{\Large \bf Linux Kernel SCTP : The Third Transport} |
| 33 | |
| 34 | %for single author (just remove % characters) |
| 35 | \author{ |
| 36 | La~Monte H.P.\ Yarroll \\ |
| 37 | %{\em Your Department} \\ |
| 38 | {\em Motorola GTSS}\\ |
| 39 | %{\em Your City, State, ZIP}\\ |
| 40 | % is there a standard format for email/URLs?? |
| 41 | % remember that ~ doesn't do what you expect, use \~{}. |
| 42 | {\normalsize piggy@acm.org} \\ |
| 43 | % |
| 44 | % copy the following lines to add more authors |
| 45 | \and |
| 46 | Karl Knutson \\ |
| 47 | {\em Motorola GTSS}\\ |
| 48 | %% is there a standard format for email/URLs?? |
| 49 | {\normalsize karl@athena.chicago.il.us} |
| 50 | % |
| 51 | } % end author |
| 52 | |
| 53 | \maketitle |
| 54 | |
| 55 | % You have to do this to suppress page numbers. Don't ask. |
| 56 | \thispagestyle{empty} |
| 57 | \renewcommand{\thefootnote}{\fnsymbol{footnote}} |
| 58 | |
| 59 | \subsection*{Abstract} |
| 60 | % nuked italics -- FD |
| 61 | |
| 62 | The Stream Control Transmission Protocol (SCTP) is a reliable |
| 63 | message-oriented protocol with transparent support for multihoming. |
| 64 | It allows multiple independent complex exchanges which all share a |
| 65 | single connection and congestion context. |
| 66 | |
| 67 | We provide an overview of the protocol, the UDP-style API and the |
| 68 | details of the Linux kernel reference implementation. The brief API |
| 69 | discussion is intended for developers wishing to use SCTP. The |
| 70 | detailed implementation discussion is for developers interested in |
| 71 | contributing to the kernel development effort. |
| 72 | |
| 73 | \section{Introduction} |
| 74 | |
| 75 | The developers at the Linux 2.5 Kernel Summit in San Jose achieved a |
| 76 | rough consensus that 2.5 should probably support SCTP, a new transport |
| 77 | protocol from the IETF. This paper introduces the ongoing work on |
| 78 | such an implementation, providing some details for both the |
| 79 | application developer and the kernel developer. |
| 80 | |
| 81 | The Stream Control Transmission Protocol (SCTP) is a reliable |
| 82 | message-oriented protocol with transparent support for multihoming. |
| 83 | It allows multiple independent complex exchanges which all share a |
| 84 | single connection and congestion context. |
| 85 | |
| 86 | \subsection{History of SCTP} |
| 87 | The SIGTRAN (Signalling Transport) Working Group of the IETF is |
| 88 | concerned with the transport of telephony signalling data over IP. |
| 89 | Upon reviewing the available standard transport protocols, they |
| 90 | concluded that none of them met the transport requirements of |
| 91 | signalling data. |
| 92 | |
| 93 | SIGTRAN concluded that they needed a new transport protocol which |
| 94 | could provide reliable message delivery, tolerate network failures, |
| 95 | and avoid the head-of-line-blocking problem. We will discuss this |
| 96 | problem later. |
| 97 | |
| 98 | The WG selected a proposal from Randall Stewart and Qiaobing Xie of |
| 99 | Motorola as a starting point. Stewart and Xie had developed a |
| 100 | Distributed Processing Environment, Quantix, aimed at telephony |
| 101 | applications. This DPE had been successfully demonstrated at Geneva |
| 102 | Telecom in 1999. |
| 103 | |
| 104 | The Working Group took great care in constructing the new protocol, |
| 105 | SCTP, incorporating many lessons learned from TCP, such as congestion |
| 106 | control, selective ACK, message fragmentation and bundling. |
| 107 | |
| 108 | The core transport protocol from Quantix brought support for |
| 109 | multihoming, message framing, and streams. We discuss all of these |
| 110 | features at length later. |
| 111 | |
| 112 | The IESG decided that the resulting protocol was robust enough to be |
| 113 | elevated from a specialised transport for telephony signalling to a new |
| 114 | general purpose transport to stand beside UDP and TCP. To this end, |
| 115 | they moved the work from SIGTRAN to TSVWG, the general transport |
| 116 | group. |
| 117 | |
| 118 | As of this writing, the core specification, \cite{rfc2960}, is at Proposed |
| 119 | Standard. There have been three successful bakeoffs covering over 25 |
| 120 | separate implementations. Lessons learned from the most recent |
| 121 | bakeoff are being written up in an ``Implementor's Guide'', \cite{impl}. |
| 122 | |
| 123 | \subsection{SCTP in the Linux kernel} |
| 124 | |
| 125 | Shortly before the first bakeoff, the IESG asked SIGTRAN to move SCTP |
| 126 | from riding on UDP to riding directly on top of IP. The long term |
| 127 | goal was clearly was to move SCTP from user space into the kernel. |
| 128 | |
| 129 | Aside from the obvious performance gains, this has the effect of |
| 130 | reducing the number of implementations to roughly one per operating |
| 131 | system. This makes it easier to verify the stability of most of the |
| 132 | implementations which appear on the Internet. |
| 133 | |
| 134 | Randall Stewart saw the importance of this and started one of the |
| 135 | authors of this paper working on a port of the user space |
| 136 | implementation to the Linux kernel. This port was intended as a |
| 137 | reference for developers of implementations for other kernels to |
| 138 | examine. The Linux kernel implementation has since diverged |
| 139 | significantly from the user space reference, but maintains the |
| 140 | standards of a reference implementation (see Coding Standards, below). |
| 141 | |
| 142 | \subsection{SCTP examples} |
| 143 | |
| 144 | SCTP is a reliable message-oriented protocol with transparent support |
| 145 | for multihoming. It allows multiple independent complex exchanges |
| 146 | which all share a single connection and congestion context. |
| 147 | |
| 148 | Many network applications operate by exchanging simultaneously, short, |
| 149 | similar sequences of data continuously. The traffic produced by these |
| 150 | operations can be characterised as MICE (Multiple Independent Complex |
| 151 | Exchanges). It is also true that many applications which use MICE |
| 152 | also have high network reliability requirements. |
| 153 | |
| 154 | \subsubsection{A database app} |
| 155 | One example is a client/server database application. Each request and |
| 156 | each response is a message. Each transaction is a sequence of |
| 157 | dependent request/response pairs. |
| 158 | |
| 159 | Implemented over TCP, this application would have to provide its own |
| 160 | message boundaries, since TCP sends bytes, not messages. How do we |
| 161 | implement MICE with TCP? We have two ways of doing this: multiple |
| 162 | connections, or a single multiplexed and reused connection. |
| 163 | |
| 164 | With each transaction over a separate TCP connection, we gain the |
| 165 | independence of transactions, but at a cost in performance. Since TCP |
| 166 | (as a general purpose transport protocol) uses congestion control, |
| 167 | each of the connections would have to go through slow-start and if |
| 168 | most transactions were short, they would never get out of slow-start. |
| 169 | |
| 170 | With all transactions over a single TCP connection, we make efficient |
| 171 | use of the network bandwidth, but open ourselves up to the |
| 172 | head-of-line blocking problem. This means that if one segment in one |
| 173 | transaction is lost, this blocks all transactions, not just the one |
| 174 | with the lost segment. |
| 175 | |
| 176 | If we use SCTP for the same application we gain the benefits of using |
| 177 | TCP, as well as advantages peculiar to SCTP. SCTP directly supports |
| 178 | messages and guarantees TCP-like levels of bandwidth efficiency via |
| 179 | bundling and fragmentation. Each database transaction can be |
| 180 | represented as an ordered stream of messages, which are independent in |
| 181 | SCTP for retransmission purposes. This means that while SCTP has the |
| 182 | same congestion control mechanisms as TCP, it does not have to resort |
| 183 | to multiple connections nor is it vulnerable to the head-of-line |
| 184 | blocking problem. |
| 185 | |
| 186 | \subsubsection{A free clinic} |
| 187 | |
| 188 | Another example of SCTP use is for a free\footnote{Free as in ``free |
| 189 | beer''.} clinic which needs a reliable way to use its IP-networked |
| 190 | patient monitoring software. |
| 191 | |
| 192 | This has many similarities to the example above in that different |
| 193 | monitoring devices would need to send simultaneous |
| 194 | information---multiple independent complex exchanges. The main |
| 195 | difference is in the higher network reliability requirements. |
| 196 | |
| 197 | A reasonable way to improve the network reliability is to set up a |
| 198 | parallel network and use multihoming for the client and server |
| 199 | applications. However, if the application is TCP-based, the |
| 200 | multihoming needs to be added to the application. With SCTP, the |
| 201 | multihoming ability is built into the protocol. All that is necessary |
| 202 | is to make the appropriate socket calls and SCTP will take advantage |
| 203 | of the addresses available in the existing network. This also applies |
| 204 | if one side of the connection has more addresses than the other. |
| 205 | |
| 206 | \section{The UDP-style API} |
| 207 | |
| 208 | Any new protocol needs an API. In particular for an Internet |
| 209 | protocol, it's important to have the API match the API normally used |
| 210 | for IP networks. This is the Berkeley sockets model---the SCTP |
| 211 | version is defined in the Internet Draft ``Sockets API Extensions for |
| 212 | SCTP''\cite{api}. The API draft defines two complementary interfaces |
| 213 | to SCTP--one for compatibility with older TCP-based applications, and |
| 214 | another for new applications designed expressly to use SCTP. The |
| 215 | Linux Kernel SCTP stack does not yet implement the former, so we |
| 216 | discuss only the UDP-style interface. |
| 217 | |
| 218 | The conceptual model of the UDP-style API is (naturally) that of plain |
| 219 | UDP. To send a message in UDP, you create a socket, bind an address |
| 220 | to it and send your message using \texttt{sendmsg()}. To receive a |
| 221 | message in UDP, you create a socket, bind an address to it and use |
| 222 | \texttt{recvmsg()}. It's much the same with the UDP-style API for |
| 223 | SCTP. To send a message, you create a socket, bind \textit{addresses} |
| 224 | to it and use \texttt{sendmsg()}. The SCTP stack underlying the API |
| 225 | handles association startup and shutdown automatically. The same goes |
| 226 | for message reception. To receive a message in UDP-style, you create |
| 227 | a socket, bind \textit{addresses} to it and use \texttt{recvmsg()}. |
| 228 | |
| 229 | The important API differences between UDP and UDP-style SCTP are: |
| 230 | multihoming; ancillary data; and the option of notifications from the |
| 231 | SCTP stack. |
| 232 | |
| 233 | \subsection{Multihoming and \texttt{bindx()}} |
| 234 | |
| 235 | There are three ways to work with multihoming with SCTP. One is to |
| 236 | ignore multihoming and use one address. Another way is to bind all |
| 237 | your addresses through the use of \texttt{INADDR\_ANY} or |
| 238 | \texttt{IN6ADDR\_ANY}. This will ``associate the endpoint with the |
| 239 | optimal subset of available local interfaces.''(Section 3.1.2, |
| 240 | \cite{api}) The most flexible way is through the use of |
| 241 | \texttt{sctp\_bindx()}, which allows additional addresses to be |
| 242 | added to a socket after the first one is bound with \texttt{bind()}, |
| 243 | but before the socket is used to transfer or receive data. The |
| 244 | function \texttt{sctp\_bindx()} is further described in section 8.1 of |
| 245 | \cite{api}. |
| 246 | |
| 247 | \subsection{Ancillary data} |
| 248 | |
| 249 | To use streams with the UDP-style API, you use ancillary data in the |
| 250 | \texttt{struct~cmsghdr} part of the \texttt{struct~msghdr} argument to |
| 251 | both \texttt{sendmsg()} and \texttt{recvmsg()}. Ancillary data is |
| 252 | used for initialisation data (\texttt{struct~sctp\_initmsg} and for |
| 253 | header data (\texttt{struct~sctp\_sndrcvinfo}). |
| 254 | |
| 255 | Ancillary data are manipulated with the macros \texttt{CMSG\_FIRSTHDR, |
| 256 | CMSG\_NEXTHDR, CMSG\_DATA, CMSG\_SPACE, \textnormal{and} CMSG\_LEN}. |
| 257 | These are all defined in \cite{rfc2292}. \cite{api} provides a nice |
| 258 | example in section 5.4.2. |
| 259 | |
| 260 | {\tt \small |
| 261 | \begin{verbatim} |
| 262 | struct sctp_initmsg { |
| 263 | uint16_t sinit_num_ostreams; |
| 264 | uint16_t sinit_max_instreams; |
| 265 | uint16_t sinit_max_attempts; |
| 266 | uint16_t sinit_max_init_timeo; |
| 267 | }; |
| 268 | \end{verbatim} |
| 269 | } |
| 270 | |
| 271 | The initialisation ancillary data sets information for starting |
| 272 | new associations. |
| 273 | |
| 274 | {\tt \small |
| 275 | \begin{verbatim} |
| 276 | struct sctp_sndrcvinfo { |
| 277 | uint16_t sinfo_stream; |
| 278 | uint16_t sinfo_ssn; |
| 279 | uint16_t sinfo_flags; |
| 280 | uint32_t sinfo_ppid; |
| 281 | uint32_t sinfo_context; |
| 282 | uint8_t sinfo_dscp; |
| 283 | sctp_assoc_t sinfo_assoc_id; |
| 284 | }; |
| 285 | \end{verbatim} |
| 286 | } |
| 287 | |
| 288 | The header ancillary data reports information gleaned from the SCTP |
| 289 | headers. If requested with the \texttt{SCTP\_RECVDATAIOEVNT} socket |
| 290 | option, this ancillary data is provided with every inbound data |
| 291 | message. There is a handy key (\texttt{sinfo\_assoc\_id}) which |
| 292 | identifies the association for this particular message. It also |
| 293 | provides the flags needed to implement partial delivery of very large |
| 294 | messages. |
| 295 | |
| 296 | Outbound messages should include an \texttt{sctp\_sndrcvinfo} ancillary |
| 297 | data structure to tell SCTP which SCTP stream to put this datagram |
| 298 | into. It is also possible to set a default stream so that this |
| 299 | ancillary data may be omitted. |
| 300 | |
| 301 | \subsection{Notifications} |
| 302 | |
| 303 | SCTP provides for the concept of optional notifications. These are |
| 304 | messages delivered in-band about events inside the SCTP stack, such as |
| 305 | a destination transport address failure or a new association coming |
| 306 | up. The notifications are marked with the \texttt{MSG\_NOTIFICATION} |
| 307 | flag in the \texttt{msg\_flags} field of the \texttt{sctp\_sendrcvinfo} |
| 308 | ancillary data. The notification is delivered as the body of the |
| 309 | message returned by \texttt{recvmsg()}. |
| 310 | |
| 311 | In \ref{notifications} we find a table of notifications. Each |
| 312 | notification delivers its own data structure which shares the same |
| 313 | name (lower case, naturally) as the notification type itself. The first |
| 314 | field of every notification is a \texttt{uint16\_t} which caries the |
| 315 | notification type. |
| 316 | |
| 317 | \begin{figure*}[t] |
| 318 | \begin{center} |
| 319 | {\tt |
| 320 | \begin{tabular}{ l l l } |
| 321 | \hline |
| 322 | \textnormal{Type} & \textnormal{Socket Option} & \textnormal{Description} \\ |
| 323 | \hline |
| 324 | SCTP\_ASSOC\_CHANGE & SCTP\_RECVASSOCEVNT & \textnormal{Change of association} \\ |
| 325 | SCTP\_PEER\_ADDR\_CHANGE & SCTP\_RECVADDREVNT & \textnormal{Change in status of a given address} \\ |
| 326 | SCTP\_REMOTE\_ERROR & SCTP\_RECVPEERERR & \textnormal{An error received from a peer} \\ |
| 327 | SCTP\_SEND\_FAILED & SCTP\_RECVSENDFAILEVNT & \textnormal{A failure to send} \\ |
| 328 | SCTP\_SHUTDOWN\_EVENT & SCTP\_RECVDOWNEVNT & \textnormal{The reception of a \texttt{SHUTDOWN} chunk} \\ |
| 329 | \end{tabular} |
| 330 | } |
| 331 | \end{center} |
| 332 | \caption{\label{notification}Useful notifications for an SCTP socket} |
| 333 | \end{figure*} |
| 334 | |
| 335 | \section{The lksctp Project} |
| 336 | |
| 337 | A critical factor in the success of any new IETF protocol is of course |
| 338 | a Linux implementation. Fortunately, key personnel at Motorola |
| 339 | recognised this and encouraged us to tackle such a project. Months |
| 340 | later, we have a core implementation with an ever-expanding feature |
| 341 | set. We now have significant participation from developers at IBM and |
| 342 | Intel and the pace is picking up. |
| 343 | |
| 344 | \subsection{Coding standards} |
| 345 | |
| 346 | In addition to the usual requirements of kernel code, our code seeks |
| 347 | to be a useful reference for people making their own kernel |
| 348 | implementations of SCTP. If a reader has some question about how to |
| 349 | implement a particular section of the RFC, they need only grep for the |
| 350 | relevant text in our code and they can find an example. As much as |
| 351 | practical, we draw names directly from the RFC. We made the state |
| 352 | machine into an explicit table (see \ref{states} for an |
| 353 | excerpt) with names that refer directly back to the relevant section |
| 354 | numbers. Clarity is a compelling requirement for our code. |
| 355 | |
| 356 | \subsection{Extreme Programming} |
| 357 | |
| 358 | As the project grew and we added developers, we clearly needed some |
| 359 | way of coordinating our work. We decided to experiment with Extreme |
| 360 | Programming, \cite{xp}. |
| 361 | |
| 362 | XP is a collection of practices aimed at controlling risk in a small |
| 363 | to medium-sized software development project. One important principle |
| 364 | is that you should do the simplest thing that could possibly work. A |
| 365 | second important principle is to take advantage of the fact that |
| 366 | programmers like to code. |
| 367 | |
| 368 | We use a range of XP practices, but the practices which are most |
| 369 | visible to anybody who reads or works on lksctp are the tests and the |
| 370 | metaphors. |
| 371 | |
| 372 | \section{The Tests} |
| 373 | |
| 374 | One of the XP practices we use is code-to-the-test. XP asks, ``If |
| 375 | testing is good, why don't we do it all the time?'' Instead of writing |
| 376 | tests for working code, write tests first, and then write code to pass |
| 377 | the tests. This practice leads to a large automated test suite which |
| 378 | runs several times per day. |
| 379 | |
| 380 | We use three kinds of test, unit tests, test frame functional tests, |
| 381 | and live kernel functional tests. |
| 382 | |
| 383 | The most basic form of test is the unit test. Unit tests exercise all |
| 384 | the interfaces of a particular object and confirm that it behaves |
| 385 | correctly. They also encode regression checks for fixed bugs. These |
| 386 | tests all have names beginning with \texttt{test\_}. |
| 387 | |
| 388 | The second form of test is the test frame functional tests. These are |
| 389 | the tests with names beginning with \texttt{ft\_frame\_}. These tests |
| 390 | check for external behaviours of the system, but with a simulated |
| 391 | kernel. The simulated kernel is very light weight and gives us very |
| 392 | fine control over things like timing and network properties. |
| 393 | |
| 394 | Ideally, functional tests should be written by the customer for a |
| 395 | system---they encode the behaviours that the customer expects. In our |
| 396 | case, we play the role of customer on behalf of the RFC. We also use |
| 397 | test frame functional tests to define work items for off-site |
| 398 | development groups. The off-site group writes tests which describe the |
| 399 | feature they intend to implement and submits those tests as a |
| 400 | proposal. This has proven an excellent medium for describing work. |
| 401 | |
| 402 | The final form of test we use is the live kernel functional test. We |
| 403 | have many fewer of these than we would like---they are difficult to run |
| 404 | since we must install and boot a kernel to test. This is much more |
| 405 | work than simply running \texttt{make unit\_test}. We are exploring UML as |
| 406 | a possible way to automate our kernel functional tests. These tests |
| 407 | have names beginning with \texttt{ft\_kern\_}. |
| 408 | |
| 409 | Code-to-the-test is a practice which you can introduce at any point in |
| 410 | a project. When you first start, it seems that you are spending more |
| 411 | time writing tests than writing code, but once you begin to have a |
| 412 | critical mass of interacting tests you begin to see significant |
| 413 | payoffs in both code quality and development velocity. |
| 414 | |
| 415 | We have had several incidents where interactions between unit tests |
| 416 | and functional tests have uncovered complimentary masking bugs. |
| 417 | |
| 418 | Tests are not a substitute for understanding code---they are a |
| 419 | mechanism for encoding that understanding to share with other |
| 420 | developers, including future versions of yourself. You can learn |
| 421 | nearly as much about our code by reading our tests as by reading the |
| 422 | code itself. |
| 423 | |
| 424 | Lately, we have begun using functional tests to encode major bugs. |
| 425 | These are among the best of all possible bug reports---they describe |
| 426 | the failure precisely and tell exactly when the problem is gone. |
| 427 | After the bugs are fixed the tests serve as part of the regression |
| 428 | suite. |
| 429 | |
| 430 | \section{The Metaphors} |
| 431 | |
| 432 | XP projects are built around a unifying metaphor rather than an |
| 433 | elaborate architecture. In our case, we chose two metaphors which |
| 434 | could serve quite well for nearly any protocol development project. |
| 435 | |
| 436 | Our metaphors are the state machine and the smart pipe. Most readers |
| 437 | are probably familiar with the state machine, but the smart pipe is a |
| 438 | twist on a familiar concept. The idea behind a smart pipe\footnote{An |
| 439 | alternate term may be ``oven''.} is that raw stuff goes in one end and |
| 440 | cooked stuff comes out the other end. |
| 441 | |
| 442 | \subsection{The State Machine} |
| 443 | |
| 444 | The state machine in our implementation is quite literal. We have an |
| 445 | explicit state table which keys to specific state functions which are |
| 446 | tied directly back to parts of the RFC. The core of the state machine |
| 447 | (found in \texttt{sctp\_do\_sm()}) is almost purely functional---only header |
| 448 | conversions are permitted. Each state function produces a description |
| 449 | of the side effects (in the form of a \texttt{struct~sctp\_sm\_retval}) |
| 450 | needed to handle the particular event. A separate side effect |
| 451 | processor, \texttt{sctp\_side\_effects()}, converts this structure into |
| 452 | actions. |
| 453 | |
| 454 | Events fall into four categories. The RFC is very explicit about |
| 455 | state transitions associated with arriving chunks. The RFC discusses |
| 456 | transitions due to primitive requests from upper layers, but many of |
| 457 | these are implementation dependent. The third category of events is |
| 458 | timeouts. The final category is a catch-all for odd events like |
| 459 | queues emptying. |
| 460 | |
| 461 | \begin{figure*}[t] |
| 462 | \begin{center} |
| 463 | {\tt |
| 464 | \begin{tabular}{ l l l l l } |
| 465 | \hline |
| 466 | \textnormal{State:} & CLOSED & COOKIE-WAIT & COOKIE-ECHOED & ESTABLISHED \\ |
| 467 | \hline |
| 468 | \hline |
| 469 | \textnormal{Chunks} & & & & \\ |
| 470 | \hline |
| 471 | |
| 472 | INIT & do\_5\_1B\_init & do\_5\_2\_1\_siminit & do\_5\_2\_1\_siminit & do\_5\_2\_2\_dupinit \\ |
| 473 | INIT ACK & discard(5.2.3) & do\_5\_1C\_ack & discard(5.2.3) & discard(5.2.3) \\ |
| 474 | COOKIE ECHO & do\_5\_1D\_ce & do\_5\_2\_4\_dupcook& do\_5\_2\_4\_dupcook& do\_5\_2\_4\_dupcook \\ |
| 475 | COOKIE ACK & discard & discard(5.2.5) & do\_5\_1E\_ca & discard(5.2.5) \\ |
| 476 | DATA & tabort\_8\_4\_8 & discard(6.0) & discard(6.0) & eat\_data\_6\_2 \\ |
| 477 | SACK & tabort\_8\_4\_8 & discard(6.0) & eat\_sack\_6\_2\_1 & eat\_sack\_6\_2\_1 \\ |
| 478 | |
| 479 | \hline |
| 480 | \textnormal{Timeouts} & & & & \\ |
| 481 | \hline |
| 482 | |
| 483 | T1-INIT TO & bug & do\_4\_2\_reinit & bug & bug \\ |
| 484 | T3-RTX TO & bug & bug & do\_6\_3\_3\_retx & do\_6\_3\_3\_retx \\ |
| 485 | |
| 486 | \hline |
| 487 | \textnormal{Primitives} & & & & \\ |
| 488 | \hline |
| 489 | |
| 490 | PRM\_ASSOCIATE & do\_PRM\_ASOC & error & error & error \\ |
| 491 | PRM\_SEND & error & do\_PRM\_SENDQ6.0 & do\_PRM\_SENDQ6.0 & do\_PRM\_SEND \\ |
| 492 | |
| 493 | \end{tabular} |
| 494 | } |
| 495 | \end{center} |
| 496 | \caption{\label{states}Portion of SCTP state table showing association initialisation} |
| 497 | \end{figure*} |
| 498 | |
| 499 | In order to create an explicit state machine, it was necessary to |
| 500 | first create an explicit state table. The process of creating this |
| 501 | table uncovered a few minor contradictions in one of the drafts of the |
| 502 | RFC. These mostly involved conflicting catch-all cases. In Figure 1 |
| 503 | we have an excerpt which shows the state functions involved in |
| 504 | initialising a new association. |
| 505 | |
| 506 | \subsection{The Smart Pipes} |
| 507 | |
| 508 | Each smart pipe has one or more structures which define its internal |
| 509 | data, and a set of functions which define its external interactions. |
| 510 | In this respect these smart pipes can be considered a type of object, |
| 511 | in the OO sense. All of these definitions can be found in the include |
| 512 | file \texttt{<net/sctp/sctpStructs.h>}. |
| 513 | |
| 514 | Most of our smart pipes have push inputs---external objects explictly |
| 515 | put things in by calling methods directly. A pull input is |
| 516 | possible---the smart pipe would need to have a way to register a |
| 517 | callback function which can fetch more input in response to some other |
| 518 | stimulus. |
| 519 | |
| 520 | Some of our pipes use pull outputs. E.g. \texttt{SCTP\_ULPqueue} passes |
| 521 | data and notifications up the protocol stack through explicit calls to |
| 522 | the socket functions, usually \texttt{readmsg(2)}. Some of our smart |
| 523 | pipes use push outputs. E.g. \texttt{SCTP\_outqueue} has a set of |
| 524 | callback functions which it invokes when it needs to send chunks out |
| 525 | toward the wire. |
| 526 | |
| 527 | There are four smart pipes in lksctp. They are |
| 528 | \texttt{SCTP\_inqueue}, \texttt{SCTP\_ULPqueue}, |
| 529 | \texttt{SCTP\_outqueue}, and \texttt{SCTP\_packet}. The first two |
| 530 | carry information up the stack from the wire to the user; the second |
| 531 | two carry information back down the stack. |
| 532 | |
| 533 | \subsubsection{\texttt{SCTP\_inqueue}} |
| 534 | |
| 535 | \texttt{SCTP\_inqueue} accepts packets and provides chunks. It is |
| 536 | responsible for reassembling fragments, unbundling, tracking received |
| 537 | TSN's for acknowledgement, and managing rwnd for congestion control. |
| 538 | There is an \texttt{SCTP\_inqueue} for each endpoint (to handle chunks |
| 539 | not related to a specific association) and one for each association. |
| 540 | |
| 541 | The function \texttt{sctp\_v4\_rcv()} (which is the receiving function |
| 542 | for SCTP registered with IPv4) calls \texttt{sctp\_push\_inqueue()} to |
| 543 | push packets into the input queue for the appropriate association or |
| 544 | endpoint. The function \texttt{sctp\_push\_inqueue()} schedules |
| 545 | either \texttt{sctp\_bh\_rcv\_asoc()} or \texttt{sctp\_bh\_rcv\_ep()} |
| 546 | on the immediate queue to complete delivery. These functions call |
| 547 | \texttt{sctp\_pop\_inqueue()} to pull data out of the |
| 548 | \texttt{SCTP\_inqueue}. This function does most of the work for this |
| 549 | smart pipe. |
| 550 | |
| 551 | The functions \texttt{sctp\_bh\_rcv\_ep()} and |
| 552 | \texttt{sctp\_bh\_rcv\_asoc()} run the state machine on incoming |
| 553 | chunks. Among many other side effects, the state machine can generate |
| 554 | events for an upper-layer-protocol (ULP), and/or chunks to go back out |
| 555 | on the wire. |
| 556 | |
| 557 | \subsubsection{\texttt{SCTP\_ULPqueue}} |
| 558 | |
| 559 | \texttt{SCTP\_ULPqueue} is the smart pipe which accepts events (either |
| 560 | user data messages or notifications) from the state machine and |
| 561 | delivers them to the ULP through the sockets layer. It is responsible |
| 562 | for delivering streams of messages in order. There is one |
| 563 | \texttt{SCTP\_ULPqueue} for every endpoint, but this is likely to |
| 564 | change at some point to one \texttt{SCTP\_ULPqueue} for each socket. |
| 565 | This smart pipe uses a data structure distributed between the |
| 566 | \texttt{struct~SCTP\_endpoint} and the |
| 567 | \texttt{struct~SCTP\_association}. |
| 568 | |
| 569 | The state machine, \texttt{sctp\_do\_sm()}, pushes data into an |
| 570 | \texttt{SCTP\_ULPqueue} by calling |
| 571 | \texttt{sctp\_push\_chunk\_ULPqueue()}. It pushes notifications with |
| 572 | \texttt{sctp\_push\_event\_ULPqueue()}. The sockets layer extracts |
| 573 | events from an \texttt{SCTP\_ULPqueue} with |
| 574 | \texttt{sctp\_pop\_ULPqueue()}. |
| 575 | |
| 576 | \subsubsection{\texttt{SCTP\_outqueue}} |
| 577 | |
| 578 | \texttt{SCTP\_outqueue} is responsible for bundling logic, transport |
| 579 | selection, outbound congestion control, fragmentation, and any |
| 580 | necessary data queueing. It knows whether or not data can go out onto |
| 581 | the wire yet. With one exception noted below, every outbound chunk |
| 582 | goes through an \texttt{SCTP\_outqueue} attached to an association. |
| 583 | The state machine injects chunks into an \texttt{SCTP\_outqueue} with |
| 584 | \texttt{sctp\_push\_outqueue()}. They automatically push out the other |
| 585 | end through a small set of callbacks which are normally attached to an |
| 586 | \texttt{SCTP\_packet}. |
| 587 | |
| 588 | The state machine is capable of putting a fully-formed packet directly |
| 589 | on the wire. At this point only \texttt{ABORT} uses this feature. It is |
| 590 | likely that we will refactor \texttt{INIT ACK} generation again to use |
| 591 | this feature. |
| 592 | |
| 593 | \subsubsection{\texttt{SCTP\_packet}} |
| 594 | |
| 595 | An \texttt{SCTP\_packet} is a lazy packet transmitter associated with a |
| 596 | specific transport. The upper layer pushes data into the packet, |
| 597 | usually with \texttt{sctp\_transmit\_chunk()}. The packet blindly |
| 598 | bundles the chunks. If the it fills (hits the PMTU for its transport), |
| 599 | it transmits the packet to make room for the new chunk. |
| 600 | \texttt{SCTP\_packet} rejects packets which need fragmenting. It is |
| 601 | possible to force a packet to transmit immediately with |
| 602 | \texttt{sctp\_transmit\_packet()}. \texttt{SCTP\_packet} tracks the |
| 603 | congestion counters, but handles none of the congestion logic. |
| 604 | |
| 605 | \section{More Data Structures} |
| 606 | |
| 607 | Not everything is a state table or a smart pipe---after all, this is |
| 608 | the kernel and we ARE programming in C. Here again, we have followed |
| 609 | the RFC very closely. Most of the key concepts in the RFC manifest |
| 610 | themselves as explicit data structures. For convenience, we refer to |
| 611 | these data structures as ``nouns''. |
| 612 | |
| 613 | Nearly all of the ``noun'' structures are designed for use with the |
| 614 | \texttt{sk\_buff} macros for list manipulation. These macros provide a |
| 615 | doubly-linked list with locking. |
| 616 | |
| 617 | \subsection{\texttt{struct~SCTP\_proto}} |
| 618 | |
| 619 | The entire lksctp universe is grounded in an instance of \texttt{ |
| 620 | struct~SCTP\_proto} accessible through \texttt{sctp\_get\_protocol()}. |
| 621 | This structure holds system-wide defaults for things like the maximum |
| 622 | number of permitted retransmissions. It contains a list of all |
| 623 | endpoints on the system. |
| 624 | |
| 625 | \subsection{\texttt{struct~SCTP\_endpoint}} |
| 626 | |
| 627 | Each UDP-style SCTP socket has an endpoint, represented as a |
| 628 | \texttt{struct~SCTP\_endpoint}. Once we implement high-bandwidth sockets and |
| 629 | TCP-style sockets, it will be possible for multiple sockets to share a |
| 630 | single endpoint structure. The endpoint structure contains a local |
| 631 | SCTP socket number and a list of local IP addresses. These two items |
| 632 | define the endpoint uniquely. In addition to endpoint-wide default |
| 633 | values and statistics, the endpoint maintains a list of associations. |
| 634 | |
| 635 | \subsection{\texttt{struct~SCTP\_association}} |
| 636 | |
| 637 | Each association structure, \texttt{struct~SCTP\_association}) is defined |
| 638 | by a local endpoint (a pointer to a \texttt{struct~SCTP\_endpoint}), and |
| 639 | a remote endpoint (an SCTP port number and a list of transport |
| 640 | addresses). This is one of the most complicated structures in the |
| 641 | implementation as it includes a great deal of information mandated by |
| 642 | the RFC. Among many other things, this structure holds the state of |
| 643 | the state machine. The list of transport addresses for the remote |
| 644 | endpoint is more elaborate than the simple list of IP addresses in the |
| 645 | local endpoint data structure since SCTP needs to maintain congestion |
| 646 | information about each of the remote transport addresses. |
| 647 | |
| 648 | \subsection{\texttt{struct~SCTP\_transport}} |
| 649 | |
| 650 | A \texttt{struct~SCTP\_transport} is defined by a remote SCTP port number |
| 651 | and an IP address. The structure holds congestion and reachability |
| 652 | information for the given address. This is also where we get the list |
| 653 | of functions to call to manipulate the specific address family. For |
| 654 | TCP you would find this information way up in the socket, but this is |
| 655 | not possible for SCTP. |
| 656 | |
| 657 | \subsection{\texttt{struct~SCTP\_chunk}} |
| 658 | |
| 659 | Possibly the most fundamental data structure in lksctp is |
| 660 | \texttt{struct~SCTP\_chunk}. This holds SCTP chunks both inbound and |
| 661 | outbound. It is essentially an extension to \texttt{struct~sk\_buff}. |
| 662 | It adds pointers to the various possible SCTP subheaders and a few |
| 663 | flags needed specifically for SCTP. One strict convention is that |
| 664 | \texttt{chunk->skb->data} is the demarcation line between headers in |
| 665 | network byte order and headers in host byte order. All outbound |
| 666 | chunks are ALWAYS in network byte order. The first function which |
| 667 | needs a field from an inbound chunk converts that full header to host |
| 668 | byte order {\it in situ}. |
| 669 | |
| 670 | \section{Acknowledgements} |
| 671 | |
| 672 | The authors are members of a team at Motorola dedicated to producing |
| 673 | open source implementations in support of IETF standardisation. We |
| 674 | would like to thank the people who make these efforts possible, |
| 675 | specifically Maureen~Govern, Stephen~Spear, Qiaobing~Xie, and |
| 676 | Irfan~Ali. We are of course deeply indebted to Randall Stewart and |
| 677 | Qiaobing Xie for having created SCTP and for starting the Linux Kernel |
| 678 | SCTP Implementation Project. We wish to recognizee the ongoing and |
| 679 | significant contributions from developers outside Motorola, especially |
| 680 | Jon Grimm and Daisy Chang of IBM, and Xingang Guo of Intel. |
| 681 | |
| 682 | \section{Availability} |
| 683 | |
| 684 | All the code discussed in this paper is available from the lksctp |
| 685 | project on Source Forge: |
| 686 | |
| 687 | \begin{center} |
| 688 | \texttt{http://sourceforge.net/projects/lksctp/} |
| 689 | \end{center} |
| 690 | |
| 691 | \begin{thebibliography}{2001} |
| 692 | |
| 693 | \bibitem[RFC2960]{rfc2960} R.~Stewart, Q.~Xie, K.~Morneault, C.~Sharp, |
| 694 | H.~J.~Schwarzbauer, T.~Taylor, I.~Rytina, M.~Kalla, L.~Zhang, and, |
| 695 | V.~Paxson, {\em Stream Control Transmission Protocol}, RFC~2960 (Oct~2000). |
| 696 | |
| 697 | \bibitem[SCTPAPI]{api} R.~Stewart, Q.~Xie, L.~H.~P.~Yarroll, J.~Wood, |
| 698 | K.~Poon, K.~Fujita., {\em Sockets API Extensions for SCTP}, Work In |
| 699 | Progress, \texttt{draft-ietf-tsvwg-sctpsocket-00.txt} (Jun~2001). |
| 700 | |
| 701 | \bibitem[SCTPIMPL]{impl} R.~Stewart. {\it et al}, |
| 702 | {\em SCTP Implementor's Guide}, Work In Progress, |
| 703 | \texttt{draft-ietf-tsvwg-sctpimpguide-00.txt} (Jun~2001). |
| 704 | |
| 705 | \bibitem[SCTPMIB]{mib} J.~Pastor, M.~Belinchon. {\em Stream Control |
| 706 | Transmission Protocol Management Information Base using SMIv2}, Work |
| 707 | In Progress, \texttt{draft-ietf-sigtran-sctp-mib-03.txt} (Feb~2001). |
| 708 | |
| 709 | \bibitem[XP]{xp} K.~Beck. {\em Extreme Programming Explained: Embrace |
| 710 | Change}, Addison-Wesley Publishers (2000). |
| 711 | |
| 712 | \bibitem[SCTPORG]{sctporg}{\em Randall Stewart's SCTP site},\\ |
| 713 | \texttt{http://www.sctp.org}, (2001). |
| 714 | |
| 715 | \bibitem[SCTPDE]{sctpde}{\em T\"uxen/Jungmeier SCTP site},\\ |
| 716 | \texttt{http://www.sctp.de}, (2001). |
| 717 | |
| 718 | \end{thebibliography} |
| 719 | |
| 720 | \end{document} |