Technical thoughts: Memento

■ MTU and MSS. While establishing new TCP connection, nodes define MSS - maximum segment size, i.e. maximum size of TCP payload in IP packet.The TCP MSS Adjustment feature enables the configuration of the maximum segment size (MSS) for transient packets that traverse a router, specifically TCP segments in the SYN bit set. PPPoE truncates the Ethernet maximum transmission unit (MTU) 1492, and if the effective MTU on the hosts (PCs) is not changed, the router in between the host and the server can terminate the TCP sessions. It also true for GRE tunnels, etc. For example, if you use an IP (generic routing encapsulation (GRE)) tunnel, the MTU of the tunnel interface is less than that of the corresponding physical interface. If MTU discovery fails due to ICMP filtering or host stack problems, then large packets are unable to traverse the tunnel. Workaround is to increase the tunnel MTU size to 1500. The ip tcp adjust-mss command specifies the MSS value on the intermediate router of the SYN packets to avoid truncation. MSS value shouldn't be too large because the different headers will be prepended to the segment and overall packet could be easily larger than MTU. At the same time, it shouldn't be too small, because the unnecessary overhead will be increased (the ratio between headers vs. user data in a single packet will increase).
Avoiding IP Fragmentation: What TCP MSS Does and How It Works
The TCP Maximum Segment Size (MSS) defines the maximum amount of data that a host is willing to accept in a single TCP/IP datagram. This TCP/IP datagram may be fragmented at the IP layer. The MSS value is sent as a TCP header option only in TCP SYN segments. Each side of a TCP connection reports its MSS value to the other side. Contrary to popular belief, the MSS value is not negotiated between hosts. The sending host is required to limit the size of data in a single TCP segment to a value less than or equal to the MSS reported by the receiving host.
The way MSS now works is that each host will first compare its outgoing interface MTU with its own buffer and choose the lowest value as the MSS to send. The hosts will then compare the MSS size received against their own interface MTU and again choose the lower of the two values.
Scenario illustrates this additional step taken by the sender to avoid fragmentation on the local and remote wires. Notice how the MTU of the outgoing interface is taken into account by each host (before the hosts send each other their MSS values) and how this helps to avoid fragmentation.

Host A compares its MSS buffer (16K) and its MTU (1500 - 40 = 1460) and uses the lower value as the MSS (1460) to send to Host B.

Host B receives Host A's send MSS (1460) and compares it to the value of its outbound interface MTU - 40 (4422).

Host B sets the lower value (1460) as the MSS for sending IP datagrams to Host A.

Host B compares its MSS buffer (8K) and its MTU (4462-40 = 4422) and uses 4422 as the MSS to send to Host A.

Host A receives Host B's send MSS (4422) and compares it to the value of its outbound interface MTU -40 (1460).

Host A sets the lower value (1460) as the MSS for sending IP datagrams to Host B.

1460 is the value chosen by both hosts as the send MSS for each other. Often the send MSS value will be the same on each end of a TCP connection.
In Scenario, fragmentation does not occur at the endpoints of a TCP connection because both outgoing interface MTUs are taken into account by the hosts. Packets can still become fragmented in the network between Router A and Router B if they encounter a link with a lower MTU than that of either hosts' outbound interface.
Note: The ip tcp path-mtu-discovery command is used to enable TCP MTU path discovery for TCP connections initiated by routers (BGP and Telnet for example).
Note: MSS value is payload of TCP, so actual packet w/o tunneling, encryption etc will be MSS + TCP header (20 bytes) + IP header (20 bytes)
What Is PMTUD?
TCP MSS as described above takes care of fragmentation at the two endpoints of a TCP connection, but it doesn't handle the case where there is a smaller MTU link in the middle between these two endpoints. PMTUD was developed to avoid fragmentation in the path between the endpoints. It is used to dynamically determine the lowest MTU along the path from a packet's source to its destination.

Note: PMTUD is only supported by TCP. UDP and other protocols do not support it. If PMTUD is enabled on a host, and it almost always is, all TCP/IP packets from the host will have the DF bit set.
When a host sends a full MSS data packet with the DF bit set, PMTUD works by reducing the send MSS value for the connection if it receives information that the packet would require fragmentation. A host usually "remembers" the MTU value for a destination by creating a "host" (/32) entry in its routing table with this MTU value.
If a router tries to forward an IP datagram, with the DF bit set, onto a link that has a lower MTU than the size of the packet, the router will drop the packet and return an Internet Control Message Protocol (ICMP) "Destination Unreachable" message to the source of this IP datagram, with the code indicating "fragmentation needed and DF set" (type 3, code 4). When the source station receives the ICMP message, it will lower the send MSS, and when TCP retransmits the segment, it will use the smaller segment size.
Problems with PMTUD
There are three things that can break PMTUD, two of which are uncommon and one of which is common.

A router can drop a packet and not send an ICMP message. (Uncommon)

A router can generate and send an ICMP message but the ICMP message gets blocked by a router or firewall between this router and the sender. (Common)

A router can generate and send an ICMP message, but the sender ignores the message. (Uncommon)

The first and last of the three bullets above are uncommon and are usually the result of an error, but the middle bullet describes a common problem. People that implement ICMP packet filters tend to block all ICMP message types rather than only blocking certain ICMP message types. A packet filter can block all ICMP message types except those that are "unreachable" or "time-exceeded." The success or failure of PMTUD hinges upon ICMP unreachable messages getting through to the sender of a TCP/IP packet. ICMP time-exceeded messages are important for other IP issues. An example of such a packet filter, implemented on a router is shown below.

access-list 101 permit icmp any any unreachable
access-list 101 permit icmp any any time-exceeded
access-list 101 deny icmp any any
access-list 101 permit ip any any

There are other techniques that can be used to help alleviate the problem of ICMP being completely blocked.

Clear the DF bit on the router and allow fragmentation anyway (This may not be a good idea, though).

Manipulate the TCP MSS option value MSS using the interface command ip tcp adjust-mss 500-1460;.

For example, Router A and Router B are in the same administrative domain. Router C is inaccessible and is blocking ICMP, so PMTUD is broken. A workaround for this situation is to clear the DF bit in both directions on Router B to allow fragmentation. This can be done using policy routing. The syntax to clear the DF bit is available in Cisco IOS® Software Release 12.1(6) and later.

interface serial0 
... 
ip policy route-map clear-df-bit 
route-map clear-df-bit permit 10 
	match ip address 111 
	set ip df 0 
 
access-list 111 permit tcp any any

Another option is to change the TCP MSS option value on SYN packets that traverse the router (available in Cisco IOS 12.2(4)T and later). This reduces the MSS option value in the TCP SYN packet so that it's smaller than the value (1460) in the ip tcp adjust-mss command. The result is that the TCP sender will send segments no larger than this value. The IP packet size will be 40 bytes larger (1500) than the MSS value (1460 bytes) to account for the TCP header (20 bytes) and the IP header (20 bytes).
You can adjust the MSS of TCP SYN packets with the ip tcp adjust-mss command. The following syntax will reduce the MSS value on TCP segments to 1460. This command effects traffic both inbound and outbound on interface serial0.

int s0 
ip tcp adjust-mss 1460

■ Considerations Regarding Tunnel Interfaces

The following are considerations when tunneling.

Fast switching of GRE tunnels was introduced in Cisco IOS Release 11.1 and CEF switching was introduced in version 12.0. CEF switching for multipoint GRE tunnels was introduced in version 12.2(8)T. Encapsulation and de-capsulation at tunnel endpoints were slow operations in earlier versions of IOS when only process switching was supported.

There are security and topology issues when tunneling packets. Tunnels can bypass access control lists (ACLs) and firewalls. If you tunnel through a firewall, you basically bypass the firewall for whatever passenger protocol you are tunneling. Therefore it is recommended to include firewall functionality at the tunnel endpoints to enforce any policy on the passenger protocols.

Tunneling might create problems with transport protocols that have limited timers (for example, DECnet) because of increased latency

Tunneling across environments with different speed links, like fast FDDI rings and through slow 9600-bps phone lines, may introduce packet reordering problems. Some passenger protocols function poorly in mixed media networks.

Point-to-point tunnels can use up the bandwidth on a physical link. If you are running routing protocols over multiple point-to-point tunnels, keep in mind that each tunnel interface has a bandwidth and that the physical interface over which the tunnel runs has a bandwidth. For example, you would want to set the tunnel bandwidth to 100 Kb if there were 100 tunnels running over a 10 Mb link. The default bandwidth for a tunnel is 9Kb.

Routing protocols may prefer a tunnel over a "real" link because the tunnel might deceptively appear to be a one-hop link with the lowest cost path, although it actually involves more hops and is really more costly than another path. This can be mitigated with proper configuration of the routing protocol. You might want to consider running a different routing protocol over the tunnel interface than the routing protocol running on the physical interface.

Problems with recursive routing can be avoided by configuring appropriate static routes to the tunnel destination. A recursive route is when the best path to the "tunnel destination" is through the tunnel itself. This situation will cause the tunnel interface to bounce up and down. You will see the following error when there is a recursive routing problem:

%TUN-RECURDOWN Interface Tunnel 0 temporarily disabled due to recursive routing

Note: By default a router doesn't do PMTUD on the GRE tunnel packets that it generates. The tunnel path-mtu-discovery command can be used to turn on PMTUD for GRE-IP tunnel packets.

The next example shows what happens when the router is acting in the role of a sending host with respect to PMTUD and in regards to the tunnel IP packet. This time the DF bit is set (DF = 1) in the original IP header and we have configured the tunnel path-mtu-discovery command so that the DF bit will be copied from the inner IP header to the outer (GRE + IP) header.
Example 4

The forwarding router at the tunnel source receives a 1476-byte datagram with DF = 1 from the sending host.

IP	1456 bytes TCP + data

This router encapsulates the 1476-byte IP datagram inside GRE to get a 1500-byte GRE IP datagram. This GRE IP header will have the DF bit set (DF = 1) since the original IP datagram had the DF bit set. This router then forwards this packet to the tunnel destination.

GRE

1456 bytes TCP

Again, assume there is a router between the tunnel source and destination with a link MTU of 1400. This router will not fragment the tunnel packet since the DF bit is set (DF = 1). This router must drop the packet and send an ICMP error message to the tunnel source router, since that is the source IP address on the packet.

IP	ICMP MTU 1400

The forwarding router at the tunnel source receives this ICMP error message and it will lower the GRE tunnel IP MTU to 1376 (1400 - 24). The next time the sending host retransmits the data in a 1476-byte IP packet, this packet will be too large and this router will send an ICMP error message to the sender with a MTU value of 1376. When the sending host retransmits the data, it will send it in a 1376-byte IP packet and this packet will make it through the GRE tunnel to the receiving host.

Memento

1 comment: