Skip to content
Homepage » Blog » Cisco Bug CSCwh80060 – APs may stop handling client traffic

Cisco Bug CSCwh80060 – APs may stop handling client traffic

Introduction

For some time now, while observing various Wi-Fi environments with a total of about 1500 access points, I have noticed that some of them have “stopped functioning properly”. Clients connected to specific access points reported problems with service availability, sometimes regardless of the type of SSID they were connected to. The first conversation with Cisco Tac in August 2023 did not bring a solution. I reported that clients connected to malfunctioning access points were being locally assigned to the access point’s management VLAN. I will note that all the analyzed access points are in flex connect mode, and the traffic from all SSIDs is switched locally.The only thing I was able to ascertain is that restarting the capwap tunnel helped in restoring the service. The issue quieted down for a while. I should note that at that time I was working with software version 17.3.X

In the meantime, the infrastructure management group updated the software, and the problem began to reappear. I started actively searching for an answer to what might be the cause.
After a longer period of analyzing the problem, I noticed that all the people who reported the issue were connected to access points whose radio (slot 0) was operating in the Monitor role, set as a result of the Flexible Radio Assignment (FRA) operation. This was a bullseye because I then observed that the access points, as a result of changing their role to Monitor due to FRA, lost their mapping, meaning the assignment of appropriate WLANs to VLANs. This explained a lot, particularly the problems I experienced in August.
I immediately reported the matter again to Cisco along with my observations, and ultimately it seems that such a situation, as I described above, can indeed occur. I also received information about a registered bug that most likely relates to my problem.

CSCwh80060 – COS APs connected with 9800 WLC losing Flex WLAN VLAN mapping

How did I come to this conclusion?

I began analyzing various commands and outputs from access points, and two became quite significant.

AP#show controllers dot11Radio 0 vlan
   
AP#show controllers dot11Radio 1 vlan

AP#show flexconnect wlan

The first two commands show the mapping of SSID to the appropriate VLAN, while the second command, among other things, shows which SSIDs are available on which radio and we can find information about whether the SSID is switched centrally or locally. By correlating the information received after executing these commands on APs that were reported as malfunctioning, I came to the conclusion that clients were not being mapped to the appropriate VLANs as per the configuration. Additionally, the last command showed that radio 0 was in monitor mode, which is why the first command also did not show SSID-to-VLAN mapping.

Examples of outputs when radio slot 0 serves as a monitor.

AP#show controllers dot11Radio 0 vlan
Vlan BSSID Pri/U/M EncryPolicy Key0 Key1 Key2 Key3 iGTK SSIDs MFP

#No information presence

AP#show controllers dot11Radio 1 vlan
Vlan BSSID Pri/U/M EncryPolicy Key0 Key1 Key2 Key3 iGTK            SSIDs MFP
   0  58AE   6 6 6 AES_CCM128       x128            DIS             SSID1   0
   0  58AD   6 6 6 AES_CCM128       x128            DIS             SSID2   0

AP#show flexconnect wlan
Flexconnect WLANs:
Radio 0 is in Monitor mode
Radio Vap             SSID State    Auth Assoc Switching
    1   0                   DOWN Central Local   Central
    1   1            SSID1    UP Central Local     Local
    1   2            SSID2    UP Central Local     Local
    1   3                   DOWN Central Local   Central
    1   4                   DOWN Central Local   Central
    1   5                   DOWN Central Local   Central
    1   6                   DOWN Central Local   Central
    1   7                   DOWN Central Local   Central
    1   8                   DOWN Central Local   Central
    1   9                   DOWN Central Local   Central
    1  10                   DOWN Central Local   Central
    1  11                   DOWN Central Local   Central
    1  12                   DOWN Central Local   Central
    1  13                   DOWN Central Local   Central
    1  14                   DOWN Central Local   Central
    1  15                   DOWN Central Local   Central


I should add that in the mentioned bug, the suggestion is that when an access point loses mapping, the command ‘show controllers nss status’ should show a value of true for ‘is_central_switching’:
Unfortunately, in my case, the problem occurred even when this value was correctly set to false. Therefore, based on the above commands, I found the following method of verification.

  • First, I check if any mapping of a given SSID to a VLAN has a value of 0. Important to note!
    Such mapping is correct for SSIDs that are switched centrally.
show controllers dot11Radio 1 vlan
  • Then, I check whether a given SSID is switched locally or centrally using this command. Example:
AP#show flexconnect wlan

If, when you run the first command, you find that an SSID is mapped to VLAN 0 and at the same time, that SSID is centrally switched, you can confidently conclude that such mapping is correct. In other words, if an SSID is associated with VLAN 0 and it’s set to be centrally switched, it’s a valid configuration. However, if this condition is not met, it indicates a problem with the mapping of the SSID to VLAN.

How to simplify the analysis process?

Since I only notice the problem on APs with slot 0 operating in Monitor mode, we can filter out such APs and then, logging in one by one, check if we are experiencing this issue.
Unfortunately, for me and the group managing thousands of access points, this can be quite time-consuming and not very efficient. Therefore, I have prepared a script that will do it for us.
Additionally, we can choose the option to check all APs because it may turn out that for some other reasons, such mapping may be lost on APs, not just those where slot 0 is in Monitor role (I personally haven’t encountered this).

Here’s what we need to use the script I’ll describe briefly later in this post:

In addition, it is necessary to ensure the availability of the following libraries:

  • netmiko
  • time
  • datetime
  • json
  • textfsm
  • sys

I won’t include descriptions of modules in this post, and I’ll focus only on the main code. If someone is interested in details of the imported modules: connecthandlerc9800 and parsedtextfsmc9800, I encourage them to read the paragraph “Modules description” in the following post: Cisco WLC 9800 Python Script: devices with the APIPA address
The same applies to the explanation of what TextFSM templates are, which I clarify in the paragraph “TextFSM template description”, also in the same post: post: Cisco WLC 9800 Python Script: devices with the APIPA address

CSCwh80060 script description

I’ll start with the fact that the script provides for the analysis of APs from the perspective of a bug in two ways:

  • A faster and more efficient option (response 1 in the code), which analyzes only access points whose slot 0 is in Monitor mode.
  • A longer method that checks all APs regardless of whether slot 0 is in Monitor mode. This is a significantly longer method because, after tests with about 1000 access points, we can expect the code execution to take about 1.5 hours. This depends on several factors, but it’s worth being aware of.
""" This part of the script defines whether it should be executed for all APs or only for those that have the Monitor role set for slot 0.  """

answear = input("\nXOR APs with Monitor role type 1 (default) \nAll APs type 2\nWhich APs want to check:")

if answear == '':
    answear = '1'
elif answear == '2':
    message = "Please note: This script may take a while to complete. Thank you for your patience!"
    message_length = len(message)
    border = "#" * (message_length + 4)

    print(border)
    print("#" + " " * (message_length + 2) + "#")
    print("# " + message + " #")
    print("#" + " " * (message_length + 2) + "#")
    print(border)

elif answear not in ['1', '2']:
    print("You typed something wrong! Script is being terminated!")
    sys.exit()

Another important assumption of the code is that all access points have the same username and password. Otherwise, an exception may occur, and such an access point will be marked in the code as one that cannot be logged into. The same effect will be obtained if the firewall blocks SSH traffic to a given access point.

""" Checking the connection availability of the access point """
def aps_with_no_access(check_ap):
    for ap in list_of_aps_with_no_access:
        if ap['IP_Address'] == check_ap['IP_Address']:
            return True
    return False

Let’s now focus on these two options we can choose from and what happens in the respective parts of the code.


When we choose the first option, we must first build a list of access points whose slot 0 is in Monitor mode.

if answear == '1':
    """ Filtering APs in Monitor mode """

    send_command_ap_monitor_mode = 'show ap do dual-band summary extended | i Monitor'
    ap_dot11_dual_band_summary_extended_template = 'wlc_c9800_show_ap_dot11_dual-band_summary_extended'

    command_output_ap_monitor_mode = connect_to_wlc(wlc, send_command_ap_monitor_mode)
    parsed_command_output_ap_monitor_mode = parse_and_process_data(ap_dot11_dual_band_summary_extended_template, command_output_ap_monitor_mode)

    filtered_ap_list_name_ip = [
        {'AP_Name': dict1['AP_Name'], 'IP_Address': dict1['IP_Address']}
        for dict1 in ap_list_name_ip
        for dict2 in parsed_command_output_ap_monitor_mode
        if dict1['AP_Name'] == dict2['AP_Name']
    ]

Then, we use the created list to log into each AP and check only the SSID to VLAN mapping for slot 1.

   """ Checking if APs with XOR radio in Monitor mode are experiencing the problem. This is the first and fundamental 
    way of verifying whether we are encountering the bug. No such behavior was observed yet for APs whose radios were not 
    operating in monitor mode. Just in case, the second part of the code ( answear == 2 ) can also check if the 
    bug is present on other APs."""

    for ap in filtered_ap_list_name_ip:
        current_ap += 1
        progress_percentage(current_ap, len(ap_list_name_ip))

        if aps_with_no_access(ap) is False:
            command_output_controllers_dot11Radio_1_vlan = execute_command_on_ap(ap, send_command_controllers_dot11Radio_1_vlan)
            if command_output_controllers_dot11Radio_1_vlan is None:
                continue
        else:
            continue

Finally, it’s just a matter of correlating the information obtained with the type of switching on the SSID: central or local. We eliminate from the list those SSIDs that are switched centrally, as it’s normal for SSID to VLAN mapping to have a value of 0. However, for locally switched SSIDs, such mapping suggests a problem and the potential for encountering a bug.

        parsed_command_output_controllers_dot11Radio_1_vlan = parse_and_process_data(controllers_dot11Radio_slot_vlan_template,command_output_controllers_dot11Radio_1_vlan)

        if len(parsed_command_output_controllers_dot11Radio_1_vlan) == 0:
            continue
        else:
            command_output_flexconnect_wlan = execute_command_on_ap(ap, send_command_flexconnect_wlan)
            parsed_command_output_flexconnect_wlan = parse_and_process_data(flexconnect_wlan_template,command_output_flexconnect_wlan)
            if affected_ap(parsed_command_output_controllers_dot11Radio_1_vlan, parsed_command_output_flexconnect_wlan) is True:
                affected_ap_list_name_ip.append(ap)
                continue
            else:
                continue


On the other hand, when we choose the second option, we must perform the entire procedure for all access points.

elif answear == '2':
    """ Checking all access points for the occurrence of the problem. """

    for ap in ap_list_name_ip:
        current_ap += 1
        progress_percentage(current_ap, len(ap_list_name_ip))

        if aps_with_no_access(ap) is False:
            command_output_controllers_dot11Radio_0_vlan = execute_command_on_ap(ap, send_command_controllers_dot11Radio_0_vlan)
            if command_output_controllers_dot11Radio_0_vlan is None:
                continue
        else:
            continue
        parsed_command_output_controllers_dot11Radio_0_vlan = parse_and_process_data(controllers_dot11Radio_slot_vlan_template,command_output_controllers_dot11Radio_0_vlan)

        command_output_controllers_dot11Radio_1_vlan = execute_command_on_ap(ap,send_command_controllers_dot11Radio_1_vlan)
        parsed_command_output_controllers_dot11Radio_1_vlan = parse_and_process_data(controllers_dot11Radio_slot_vlan_template,command_output_controllers_dot11Radio_1_vlan)


We also check and correlate the information received from the two radio slots of the access point.

       if len(parsed_command_output_controllers_dot11Radio_0_vlan) == 0:
            continue
        else:
            command_output_flexconnect_wlan = execute_command_on_ap(ap, send_command_flexconnect_wlan)
            parsed_command_output_flexconnect_wlan = parse_and_process_data(flexconnect_wlan_template,command_output_flexconnect_wlan)
            if affected_ap(parsed_command_output_controllers_dot11Radio_0_vlan, parsed_command_output_flexconnect_wlan) is True:
                affected_ap_list_name_ip.append(ap)
                continue
            else:
                continue


        if len(parsed_command_output_controllers_dot11Radio_1_vlan) == 0:
            continue
        else:
            command_output_flexconnect_wlan = execute_command_on_ap(ap, send_command_flexconnect_wlan)
            parsed_command_output_flexconnect_wlan = parse_and_process_data(flexconnect_wlan_template,command_output_flexconnect_wlan)
            if affected_ap(parsed_command_output_controllers_dot11Radio_1_vlan, parsed_command_output_flexconnect_wlan) is True:
                affected_ap_list_name_ip.append(ap)
                continue
            else:
                continue

In the end, the code saves the results in a JSON format file, as well as those access points to which we could not log in for various reasons. The code also provides information about the time required to execute the script and, in the case of checking all access points, displays the time every 10% of APs processed from the entire list.

   """ Saving the results to a file. """

end_time = datetime.datetime.now()

elapsed_time = end_time - start_time
formatted_time = format_duration(elapsed_time)
formatted_time = format_duration(elapsed_time)

timestamp = end_time.strftime("%d_%b_%Y_%H_%M_%S")

aps_with_no_access_json = 'aps_with_no_access_json_'+ str(wlc['host'] + f'_date_{timestamp}')

with open(aps_with_no_access_json, 'wt') as file:
    json.dump(list_of_aps_with_no_access, file, indent = 4)

affected_aps_json = 'affected_aps_json_wlc_' + str(wlc['host'] + f'_date_{timestamp}')

with open(affected_aps_json,'wt') as file:
    json.dump(affected_ap_list_name_ip,file, indent=4)

print("\nCode execution ended. Time passed: ", formatted_time, '\n')
print(f'File created and saved for affected APs by CSCwh80060 bug: {affected_aps_json}')
print(f'File created with APs not checked because of connection issues: {aps_with_no_access_json}')


Finally, one very important note. The code is adapted for Cisco access points with two radio slots, such as the 9120. It will also work with other APs, so I suggest possibly changing the part of the code that will meet specific requirements.

What’s next?

If we already have a list of access points that are affected by this bug, we can do three things:

  • Restart the AP, which will temporarily solve the problem (you can first reset the CAPWAP tunnel, which sometimes also achieves the desired effect and is faster).
  • We can switch slot 0 from automatic mode to Client Serving mode by selecting the appropriate frequency.
  • If the problem is significant, we can disable FRA (Flexible Radio Assignment) globally, remembering that the roles of the APs will remain as they were at the time of the disabling. However, upon restarting the AP, they will automatically change to Client Serving 2.4GHz.


The solution depends on the specific situation in the given wireless network environment.

Summary


Undoubtedly, this problem is quite troublesome as it can occur from time to time at different access points, which can significantly affect clients who roam between APs.
I have noted this problem only in connection with the operation of FRA (Flexible Radio Assignment). Therefore, everyone who has APs with XOR radio and FRA enabled and functioning should pay attention to this bug.

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x