Avoiding IP Fragmentation: What TCP MSS Does and How It Works
The TCP Maximum Segment Size (MSS) defines the maximum amount of data that a host is willing to accept in a single TCP/IP datagram. This TCP/IP datagram may be fragmented at the IP layer. The MSS value is sent as a TCP header option only in TCP SYN segments. Each side of a TCP connection reports its MSS value to the other side. Contrary to popular belief, the MSS value is not negotiated between hosts. The sending host is required to limit the size of data in a single TCP segment to a value less than or equal to the MSS reported by the receiving host.
The way MSS now works is that each host will first compare its outgoing interface MTU with its own buffer and choose the lowest value as the MSS to send. The hosts will then compare the MSS size received against their own interface MTU and again choose the lower of the two values.
Scenario illustrates this additional step taken by the sender to avoid fragmentation on the local and remote wires. Notice how the MTU of the outgoing interface is taken into account by each host (before the hosts send each other their MSS values) and how this helps to avoid fragmentation.
- Host A compares its MSS buffer (16K) and its MTU (1500 - 40 = 1460) and uses the lower value as the MSS (1460) to send to Host B.
- Host B receives Host A's send MSS (1460) and compares it to the value of its outbound interface MTU - 40 (4422).
- Host B sets the lower value (1460) as the MSS for sending IP datagrams to Host A.
- Host B compares its MSS buffer (8K) and its MTU (4462-40 = 4422) and uses 4422 as the MSS to send to Host A.
- Host A receives Host B's send MSS (4422) and compares it to the value of its outbound interface MTU -40 (1460).
- Host A sets the lower value (1460) as the MSS for sending IP datagrams to Host B.
In Scenario, fragmentation does not occur at the endpoints of a TCP connection because both outgoing interface MTUs are taken into account by the hosts. Packets can still become fragmented in the network between Router A and Router B if they encounter a link with a lower MTU than that of either hosts' outbound interface.
Note: The ip tcp path-mtu-discovery command is used to enable TCP MTU path discovery for TCP connections initiated by routers (BGP and Telnet for example).
Note: MSS value is payload of TCP, so actual packet w/o tunneling, encryption etc will be MSS + TCP header (20 bytes) + IP header (20 bytes)
What Is PMTUD?
TCP MSS as described above takes care of fragmentation at the two endpoints of a TCP connection, but it doesn't handle the case where there is a smaller MTU link in the middle between these two endpoints. PMTUD was developed to avoid fragmentation in the path between the endpoints. It is used to dynamically determine the lowest MTU along the path from a packet's source to its destination.
Note: PMTUD is only supported by TCP. UDP and other protocols do not support it. If PMTUD is enabled on a host, and it almost always is, all TCP/IP packets from the host will have the DF bit set.
When a host sends a full MSS data packet with the DF bit set, PMTUD works by reducing the send MSS value for the connection if it receives information that the packet would require fragmentation. A host usually "remembers" the MTU value for a destination by creating a "host" (/32) entry in its routing table with this MTU value.
If a router tries to forward an IP datagram, with the DF bit set, onto a link that has a lower MTU than the size of the packet, the router will drop the packet and return an Internet Control Message Protocol (ICMP) "Destination Unreachable" message to the source of this IP datagram, with the code indicating "fragmentation needed and DF set" (type 3, code 4). When the source station receives the ICMP message, it will lower the send MSS, and when TCP retransmits the segment, it will use the smaller segment size.
Problems with PMTUD
There are three things that can break PMTUD, two of which are uncommon and one of which is common.
- A router can drop a packet and not send an ICMP message. (Uncommon)
- A router can generate and send an ICMP message but the ICMP message gets blocked by a router or firewall between this router and the sender. (Common)
- A router can generate and send an ICMP message, but the sender ignores the message. (Uncommon)
access-list 101 permit icmp any any unreachable
access-list 101 permit icmp any any time-exceeded
access-list 101 deny icmp any any
access-list 101 permit ip any any
There are other techniques that can be used to help alleviate the problem of ICMP being completely blocked. - Clear the DF bit on the router and allow fragmentation anyway (This may not be a good idea, though).
- Manipulate the TCP MSS option value MSS using the interface command ip tcp adjust-mss 500-1460;.
interface serial0
...
ip policy route-map clear-df-bit
route-map clear-df-bit permit 10
match ip address 111
set ip df 0
access-list 111 permit tcp any any
Another option is to change the TCP MSS option value on SYN packets that traverse the router (available in Cisco IOS 12.2(4)T and later). This reduces the MSS option value in the TCP SYN packet so that it's smaller than the value (1460) in the ip tcp adjust-mss command. The result is that the TCP sender will send segments no larger than this value. The IP packet size will be 40 bytes larger (1500) than the MSS value (1460 bytes) to account for the TCP header (20 bytes) and the IP header (20 bytes). You can adjust the MSS of TCP SYN packets with the ip tcp adjust-mss command. The following syntax will reduce the MSS value on TCP segments to 1460. This command effects traffic both inbound and outbound on interface serial0.
int s0
ip tcp adjust-mss 1460
■ Considerations Regarding Tunnel InterfacesThe following are considerations when tunneling.
- Fast switching of GRE tunnels was introduced in Cisco IOS Release 11.1 and CEF switching was introduced in version 12.0. CEF switching for multipoint GRE tunnels was introduced in version 12.2(8)T. Encapsulation and de-capsulation at tunnel endpoints were slow operations in earlier versions of IOS when only process switching was supported.
- There are security and topology issues when tunneling packets. Tunnels can bypass access control lists (ACLs) and firewalls. If you tunnel through a firewall, you basically bypass the firewall for whatever passenger protocol you are tunneling. Therefore it is recommended to include firewall functionality at the tunnel endpoints to enforce any policy on the passenger protocols.
- Tunneling might create problems with transport protocols that have limited timers (for example, DECnet) because of increased latency
- Tunneling across environments with different speed links, like fast FDDI rings and through slow 9600-bps phone lines, may introduce packet reordering problems. Some passenger protocols function poorly in mixed media networks.
- Point-to-point tunnels can use up the bandwidth on a physical link. If you are running routing protocols over multiple point-to-point tunnels, keep in mind that each tunnel interface has a bandwidth and that the physical interface over which the tunnel runs has a bandwidth. For example, you would want to set the tunnel bandwidth to 100 Kb if there were 100 tunnels running over a 10 Mb link. The default bandwidth for a tunnel is 9Kb.
- Routing protocols may prefer a tunnel over a "real" link because the tunnel might deceptively appear to be a one-hop link with the lowest cost path, although it actually involves more hops and is really more costly than another path. This can be mitigated with proper configuration of the routing protocol. You might want to consider running a different routing protocol over the tunnel interface than the routing protocol running on the physical interface.
- Problems with recursive routing can be avoided by configuring appropriate static routes to the tunnel destination. A recursive route is when the best path to the "tunnel destination" is through the tunnel itself. This situation will cause the tunnel interface to bounce up and down. You will see the following error when there is a recursive routing problem:
- %TUN-RECURDOWN Interface Tunnel 0 temporarily disabled due to recursive routing
The next example shows what happens when the router is acting in the role of a sending host with respect to PMTUD and in regards to the tunnel IP packet. This time the DF bit is set (DF = 1) in the original IP header and we have configured the tunnel path-mtu-discovery command so that the DF bit will be copied from the inner IP header to the outer (GRE + IP) header.
Example 4
- The forwarding router at the tunnel source receives a 1476-byte datagram with DF = 1 from the sending host.
IP 1456 bytes TCP + data - This router encapsulates the 1476-byte IP datagram inside GRE to get a 1500-byte GRE IP datagram. This GRE IP header will have the DF bit set (DF = 1) since the original IP datagram had the DF bit set. This router then forwards this packet to the tunnel destination.
IP GRE IP 1456 bytes TCP - Again, assume there is a router between the tunnel source and destination with a link MTU of 1400. This router will not fragment the tunnel packet since the DF bit is set (DF = 1). This router must drop the packet and send an ICMP error message to the tunnel source router, since that is the source IP address on the packet.
IP ICMP MTU 1400 - The forwarding router at the tunnel source receives this ICMP error message and it will lower the GRE tunnel IP MTU to 1376 (1400 - 24). The next time the sending host retransmits the data in a 1476-byte IP packet, this packet will be too large and this router will send an ICMP error message to the sender with a MTU value of 1376. When the sending host retransmits the data, it will send it in a 1376-byte IP packet and this packet will make it through the GRE tunnel to the receiving host.
Very nice article, you make this much easier to understand than Cisco.
ReplyDelete