rpcWifi problem when wifi is lost

Hello !!

i have made a medium project using Wifi on WIOT.

i have a problem when the wifi connexion is lost, all my tasks seems to hang…
after some investiguation, i can see something wrong when calling udp.begin(xx) when the Wifi is being lost.

here my code (slightly modified example to query NTP server)

static time_t _getNTPtime(void)
{
// module returns a unsigned long time valus as secs since Jan 1, 1970
// unix time or 0 if a problem encounted
WiFiUDP udp;
unsigned long epoch=0;
const int NTP_PACKET_SIZE = 48; // NTP time stamp is in the first 48 bytes of the message
byte packetBuffer[NTP_PACKET_SIZE]; //buffer to hold incoming and outgoing packets
  //initializes the UDP state
  //This initializes the transfer buffer
  Serial.println("Sync time with NTP");
  udp.begin(WiFi.localIP(), 2390);
  _sendNTPpacket(&udp, timeServer, packetBuffer, NTP_PACKET_SIZE); // send an NTP packet to a time server
  // wait to see if a reply is available
  delay(1000);
  if (udp.parsePacket())
  {
    Serial.println("udp packet received");
    // We've received a packet, read the data from it
    if (udp.read(packetBuffer, NTP_PACKET_SIZE)==NTP_PACKET_SIZE) // read the packet into the buffer
    {
      //the timestamp starts at byte 40 of the received packet and is four bytes,
      // or two words, long. First, extract the two words:
      unsigned long highWord = word(packetBuffer[40], packetBuffer[41]);
      unsigned long lowWord = word(packetBuffer[42], packetBuffer[43]);
      // combine the four bytes (two words) into a long integer
      // this is NTP time (seconds since Jan 1 1900):
      unsigned long secsSince1900 = highWord << 16 | lowWord;
      // Unix time starts on Jan 1 1970. In seconds, that's 2208988800:
      const unsigned long seventyYears = 2208988800UL;
      // subtract seventy years:
      epoch = secsSince1900 - seventyYears;
    }
    else
    {
      Serial.println("udp packet not complete");
    }
  }
  // not calling ntp time frequently, stop releases resources
  udp.stop();
  return epoch ;
}

i have enabled the verbose mode for RPCWiFi and added some more trace inside udp.begin fonction…

Sync time with NTP
[E][WiFiUdp.cpp:42] begin(): before stop
[E][WiFiUdp.cpp:44] begin(): after stop
[D][WiFiGeneric.cpp:383] _eventCallback(): Event: 5 - STA_DISCONNECTED
[W][WiFiGeneric.cpp:407] _eventCallback(): Reason: 0 - MAX
Disconnected from WIFI access point
WiFi lost connection. Reason: 0

As you could see, the wifi disconnected event is rised after i begin to deal with udp.
and the task trying to call udp.begin is blocked after the 1st instruction (stop():wink:
The instruction that is probably waiting infinitly something seems to be "socket’

uint8_t WiFiUDP::begin(IPAddress address, uint16_t port){
    log_e("before stop");
  stop();
    log_e("after stop");
  server_port = port;
  tx_buffer = new char[1460];
  if(!tx_buffer){
    log_e("could not create tx buffer: %d", errno);
    return 0;
  }
  if ((udp_server=socket(AF_INET, SOCK_DGRAM, 0)) == -1){
    log_e("could not create socket: %d", errno);
    return 0;
  }
    log_e("after socket");
  int yes = 1;
  if (setsockopt(udp_server,SOL_SOCKET,SO_REUSEADDR,&yes,sizeof(yes)) < 0) {
      log_e("could not set socket option: %d", errno);
      stop();
      return 0;
  }
    log_e("after setsockopt");
  struct sockaddr_in addr;
  memset((char *) &addr, 0, sizeof(addr));
  addr.sin_family = AF_INET;
  addr.sin_port = htons(server_port);
  addr.sin_addr.s_addr = (in_addr_t)address;
  if(bind(udp_server , (struct sockaddr*)&addr, sizeof(addr)) == -1){
    log_e("could not bind socket: %d", errno);
    stop();
    return 0;
  }
     log_e("after bind");
 fcntl(udp_server, F_SETFL, O_NONBLOCK);
  return 1;
}

it seems WiFi.status() can also be blocking… in this case
hereunder, 1 task use HTTPClient for a request on a server… (and never returns)
1 other task try to start a NTP sync…but WiFi.status() never returns also…
all tasks using rpcWifi are now blocked infinitly…

Read measurements
[V][HTTPClient.cpp:236] beginInternal(): url: http://someip/somerequest.php
[D][HTTPClient.cpp:277] beginInternal(): host: someip port: 80 url: /somerequest.php
[D][HTTPClient.cpp:563] sendRequest(): request type: ‘POST’ redirCount: 0
[D][WiFiGeneric.cpp:383] _eventCallback(): Event: 5 - STA_DISCONNECTED
[W][WiFiGeneric.cpp:407] _eventCallback(): Reason: 0 - MAX
Disconnected from WIFI access point
WiFi lost connection. Reason: 0
Sync time from NTP

the associated code:

      Serial.println("Sync time from NTP");
      if (WiFi.status() == WL_CONNECTED)
      {
        time_t ntpdt=_getNTPtime();

the _getNTPtime() function is the one at the begining of this post…

It seems there is a kind of deadlock situation somewhere in rpcWifi…
is someone can help ??

oh, i forgot, i’m using FreeRTOS and plateform.io, memory & stack seems to be good enough…

Eric.

I have more clues,
all stops in erpc_transport_arbitrator.cpp in function TransportArbitrator::clientReceive

for any reason, at certain time, the semaphore is never set…
and on my investiguations, i have at least 3 “client” waiting for something…

the function TransportArbitrator::receive is not called to receive any datas… and then the semaphore is not set…

but the great question is… Why ???
is there a bug in the RTL8720 firmware ?

hello seeds guys…

i have continuated my investiguations…
now, i’m pretty sure RTL8720 FW is stuck somewhere when the Wifi is lost (AP simply turned off, fail 100% time)

i have added lots of traces in erpc lib and all end by a sent message like:

00091001c7060000000000001000000000020050c0a809f2000000000000000010000000

and the lib never receive response… so finally all clients were stuck waiting semaphore and then all my tasks also stuck waiting client

it is strange there is no timeout on the message exchange with RTL chip… even if in this case, we also need a way to restart the RTL chip…

is there a way to debug the RTL chip ?? :face_with_monocle:

Hi @Eric_Bouxirot,
there were some posts in the past where people had the feeling that the WiFi communication sometimes hangs.
( Wio Terminal App (rpcWiFi lib) sending telemetry data to Azure Storage Tables stops working after many uploads ).
Unfortunately it seems that Seeed actually isn’t investing much activity in this matter. I hope that they come back to work on this soon and I would be happy if you could find the reason for this issue.
In my applications I could use the watchdog to come out of the unresponsive states.

Hi @RoSchmi ,

thank for your answer… i just read your post… and many others …:pensive:
you have made several tests… like me…
For me the watchdog is not a good workaround…

i also check the memory leaks, Stack overflow, etc… all seems to be good with huge margin…
IMHO, it’s pretty clear the RTL8720 seems to be faulty… (sometimes it start to no answer on the UART transport lower layer at all) and starting here, SAMD code start to be stuck forever at each request to RTL… (in this case, my supervision task still run and show nothing special)

i’m a little bit afraid Seeed do not work anymore on this product/failure…
@lakshan ?? have you any news on all theses rpc issues?

As mentionned in my previous email, i can help to find the pb on the RTL side, but i need info how i can do this…
it seems the RTL code is a kind of black box, at least for some parts… and due to the indirect access to it, i think it is a pain to debug…
But i can try… with help…

Eric.

I share @Eric_Bouxirot’s concern that maybe Seed has decided that the Wifi isn’t something they can get working at this time/ that this forum isn’t “the place to be” these days. Worrying… but I guess the Wio terminal is still great fun for a great price… for other things.

I wonder if the problems I reported (and my tiny ideas for problem identification) reported at…

… might be relevant to the discussion in this thread?

I gather that those of you here do SOMETIMES get connected to your APs? Then they fail, and your system is blocked while they TRY to reconnect… and then (often? always?) they never do succeed? How long are they blocked… along the lines of my 20 seconds?

I’m not sure I ever achieved a first connection!

Even if I haven’t supplied the right password… the easy, and not impossible “answer”… I wouldn’t expect my system to hang for 20 seconds while it TRIED, and I would expect a more elegant response to the problem, and more transparency, when “the system” gives up TRYING to connect.

I’m now following this discussion, too… hoping to see your problems solved. Any comments on my code, my interpretations reported in my thread welcome… here or there. Even a “we saw this, we think it does/ does not relate to our questions” post at my thread welcome.

Tom

Hi everyone,

i implemented a supervisor task that makes a device reset (each reset is logged) each time the eRPC lib is stuck.

i’m currently testing adding mutexes on my app to avoid concurrent call to the eRPC library.

At the time i write my app works fine since 12h…

Wait & see…