Skip to content
Homepage » Blog » Windows 10 and Zscaler – What can go wrong?

Windows 10 and Zscaler – What can go wrong?

Who in the corporation uses Zscaler’s proxy and works on devices with Windows 10?

Raise your hand.

Great. Now, raise your hand if you’ve experienced wireless network issues in such an environment?

Awesome. A quick check and you’ll know if you should delve deeper into this topic, for instance by reading my post.

I’m not kidding; this is a serious matter. Everyone responsible for service from both a business and operational perspective needs to know this!

But what’s happening?

Unfortunately, some time ago, after an update to Windows 10 systems, clients began to complain that they were being ‘kicked’ from the network. They noticed that when they came to the office, after about 5 minutes, the connection stabilized, and they could work freely. Until then, literally, the wireless card would disappear and reappear in the taskbar next to the clock. To make matters worse, from time to time, they were disconnected during the day.

A bit of detail

Let’s first focus on the device with the Windows 10 system. Every time we connect to a “network”, our devices check whether this process has been successful and whether we have internet access. For this purpose, Windows uses a service called Network Connectivity Status Indicator (NCSI), which is part of another service, Network Location Awareness (NLAsvc).

To check internet access, NSCI uses two mechanisms, “active probe” and “passive probe.” They operate independently. These mechanisms are activated every time there is a change in the network card’s status (removal/addition of a new gateway on the adapter, establishment of a VPN tunnel, Wi-Fi connection, or restarting the NLAsvc service). A visual effect of these actions is an information type “No connection” or “Limited internet access” when hovering over the Wi-Fi icon. Before the process ends successfully, we may see a “globe” icon instead of the Wi-Fi or computer icon.

How does the NSCI ‘active probe’ process work?

For Windows 10 and newer systems, it looks as follows:

  1. NCSI sends a DNS request to resolve the FQDN address of www.msftconnecttest.com.
  2. If NCSI receives a correct response from the DNS server, it sends a standard HTTP GET request to http://www.msftconnecttest.com/connecttest.txt.
  3. If NCSI successfully downloads the text file, it ensures that the file contains the Microsoft Connect Test.
  4. NCSI sends another DNS request to resolve the FQDN address of dns.msftncsi.com.

If any of these requests fail, a network alert will appear on the taskbar (as described in the issue). If all of these requests succeed, a regular network icon will appear on the taskbar.

How does the NCSI “passive probe” process work?

This is a process independent from the one described above and occurs by default every 15 seconds. It launches various tests and algorithms based on the user’s current activity on the computer. Live network traffic is captured and analyzed without interfering with the network, i.e., without sending any packets. The TTL (Time To Live) in the IP packet headers is checked to determine how many “hops” a packet took to reach the computer. It is assumed that a packet with more than 8 hops has an internet connection.

What does Zscaler have in common with all of this?

The answer to the above question was not straightforward, but with the support of Microsoft, we managed to isolate and solve the problem.

Computers that reported the issue had the Zscaler client installed. It turned out that the “Zscaler tunnel” service modifies the network interface (wireless) by removing the default broadcast route on the interface (not in the routing table!). Windows detects the change and activates the NLAsvc service.

The logs below are the result of scripts’ operation and the analysis of the problem by Microsoft.

[Microsoft-Windows-TCPIP/Diagnostic] IP: Interface property change. Interface = 15, Compartment = 1, Protocol = IPV4, Advertise = FALSE, AdvertiseDefaultRoute = FALSE, Forward = FALSE, ForwardMulticast = FALSE, UseNud = TRUE, AdvertisingEnabled = FALSE, WeakHostSend = FALSE, WeakHostReceive = FALSE

Changing the interface properties activates the already known NSCI active probe process, which proceeds successfully.

[Microsoft-Windows-WebIO/Diagnostic] 0x204B9225C60: =====Generate Headers=====================
[Microsoft-Windows-WebIO/Diagnostic] 0x204B9225C60: Request Message Generated (DataChunk 0x204B9CE8460[0x8D])
[Microsoft-Windows-WebIO/Diagnostic] 0x204B9225C60: =====Send Headers=========================
[Microsoft-Windows-WebIO/Diagnostic] 0x204B9225C60: Sending Headers: GET http://www.msftconnecttest.com/connecttest.txt HTTP/1.1
[Microsoft-Windows-WebIO/Diagnostic] 0x204B9225C60: =====Receive Headers======================
[Microsoft-Windows-WebIO/Diagnostic] 0x204B9225C60: HTTP Parser (Connection 0x204B9CF3F40) (Buffer: 0x204B9C5DF50 [0x0/0x21B/0xDB5]) (ParserChunk 0x0 [0x0]) 0x204B9225C60
[Microsoft-Windows-WebIO/Diagnostic] 0x204B9225C60 Received Headers: HTTP/1.1 200 OK

Unfortunately, due to a system error, the connection is classified as “Bad state”, even though the “active probe” process was successful.

[wcmserver] routemanagerbadconnectionstate_cpp389 RouteManagement::BadConnectionState::BadStateTimerCallback() – RouteManagement::BadConnectionState::BadStateTimerCallback – the interface b290987c-a060-47a6-af10-44536b270ee0 (wcm_media_wlan) is ConnectionQualityState::BadConnectivity for 8156 ms (triggers 5)

This classification is carried out thanks to the RnR (Reset and Recovery – part of the Windows Connection Manager – WCM) functionality, which was introduced with the Windows 10 1809 version. Its task is first to classify the connection as a “Bad State” under certain conditions, and then attempt to fix the arisen issue.

RnR, as a repair mechanism, first tries to force the client to change the access point (roaming) it connects to, by disconnecting and reconnecting to the wireless network. The next step is to reset the wireless card.

If we experience a similar problem and cannot cooperate with Microsoft, we can generate a “wlan report” in the Windows system and confirm there if we are actually dealing with a similar situation.

To do this, you need to launch Command Prompt as an administrator by pressing Start, then type “cmd” in the search field. Right-click on the result “Command Prompt”, and then choose “Run as administrator”.

After the prompt, type the following command and press enter:

netsh wlan show wlanreport

Windows generates the report and stores it in the following location:

By default C: /ProgramData/Microsoft/Windows/WlanReport/wlan-report-latest.html.

We are looking for our connection, which should be the last one in the report. Let’s analyze it from the perspective of this example.

The test began at 08:51:38 when connecting to the wireless network. Around 8:52:18 the first disconnection occurs.

This is how it looks from the perspective of a client connected to the wireless network.

The connection is re-established at 8:52:29, but a few moments later we are disconnected again. This time, the network card is reset.

This time, this is how it looks from the client’s perspective.

Unfortunately, the disconnection from the wireless network and the network card restart occur once again. Approximately around 8:53 and 8:54.

Ultimately, after about 5 minutes, the connection is stable. Time for a solution.

Solution

The quickest solution is to migrate to Windows 11. I know. In corporations, it’s not that simple. You might ask why Windows 11 doesn’t experience such problems. Well, Microsoft fixed the bug in this version of the software, but unfortunately, Windows 10 hasn’t received the update. In fact, as of now, it’s unknown when or if it will happen at all.

There’s a workaround, quite simple, tested by me on a large scale and confirmed by Microsoft.

The solution is a registry entry that disables the functionality of RnR (Reset and Recovery – a part of the Windows Connection Manager – WCM), which was introduced with the Windows 10 1809 version. It doesn’t affect the operation of other services and is fully transparent to other Microsoft services.

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\WcmSvc
Type: REG_DWORD
Value name: EnableBadStateTracking
Values: 0 (disabled)

Summary

With the release of Windows 10 version 1809, the RnR (Reset and Recovery) functionality was introduced, which was designed to respond to internet access issues. Unfortunately, when combined with the Zscaler service, a problem arose with the correct operation of the aforementioned function. The result is that clients are disconnected, multiple times, until the connection stabilizes. Disabling the feature in the registry resolves the issue.

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x