Best way to transmit a .jpg from ESP32-S3 Sense Camera to Jetson Nano?

Hello, All.

I’m working on a small project and am looking for suggestions.

I built a small module that uses the ESP32-S3 Sense with a camera module to take an image when a button is pressed and save it to an SD card (Fairly straight forward, this works fine). I also took a Jeton Orin Nano and installed a Convolutional Neural Network (YOLO26s). I then wrote a simple python script that takes an image file as a command line argument, and prints out the object classification(s) (this also works fine).

What I’d like to do is setup the ESP32-S3 to send its image to the jetson for image classification.

I initially thought I would set up a simple MQTT service, where the ESP32 would publish the image to a topic that the Jetson is subscribed to, however, even with compression, I’m not sure this would be a good idea in practice.

I THINK that the only real solution is to send the image file over Wi-Fi to the Jetson (they’ll be on the same network). I’m just not sure the best way to do it. Any recommendations?

I imagine that I could just have a python script running on the Jetson in an endless while-loop that scans for the IP address of the ESP32-S3, and performs its analysis on new images as they’re posted. Is this a good idea? If so, does it make more sense to set up the ESP32-S3 as the server or the Jetson? I haven’t done much Wi-Fi stuff, so this is a bit new to me and would appreciate any and all advice.

Thank you,

Joe

SMALL UPDATE

I’ve written a fairly simple python script on the Jetson that pulls images from a specified URL.

I suppose it would make the most sense at this point to make the ESP32-S3 a simple web server that posts its images as they’re taken. The Jetson will be polling and perform classification when it detects that the image has changed on the web server has changed. I’m not sure of the best way how to do that, actually :thinking:

Any advice would be helpful.

Thank you,

Joe

I think you can use Jetson and server and ESP32 as client. The Jetson is vastly more capable than the ESP32. Let the Jetson expose an HTTP endpoint, and have the ESP32 push images to it whenever a button is pressed.

That sounds pretty neat. How might I go about doing that? Google says to use a Flask server, which I’ve never heard of until now.

Hi there,

So 2 things I would offer, Use the built and working Web Server and poll it with the jetson, kinda old school, or go a little more modern and use WebSockets :+1:
:grin:

Using the stock camera web server to stream or post images to the Jetson is a fantastic, highly practical way to get started. It relies on proven, stable code, making it an excellent baseline.

Option 1: The Stock Web Server (The Reliable Baseline)

In this setup, the ESP32-S3 acts as an HTTP server hosting a stream (like the default app_httpd.cpp examples), and your Jetson script periodically grabs frames via HTTP GET requests.

  • Why it’s good: It is very easy to debug. You can open a web browser on your PC, type in the ESP32’s IP address, and instantly see if the camera is working independent of your Jetson code.
  • Why it’s a bit “uncool”: Polling. Having the Jetson constantly hit a URL in an endless loop to check if an image is “new” wastes processing cycles and introduces latency. If the Jetson asks for an image before the ESP32 has fully written a new frame, you can get torn or corrupted images.

The “Cooler & Faster” Way: WebSockets or Raw TCP Sockets

Instead of the Jetson asking for images (Pull), change the architecture so the ESP32 instantly pushes the image the millisecond it takes it (Push).

  • How it works: You run a lightweight WebSocket or TCP server on the Jetson Nano. When the ESP32-S3 boots up, it connects to the Jetson. The moment you press the physical button on your module, the ESP32 grabs the frame buffer and dumps the raw binary .jpg straight down the open socket socket connection.
  • Why it’s better: Zero-latency. There is no HTTP overhead (headers, handshakes) for every single image. The Jetson just sits silently, blocking on the socket, and wakes up the exact millisecond data arrives to feed it right into your YOLO network.
  • The “Cool” Factor: You can build a real-time pipeline. Instead of just one snapshot when a button is pressed, a WebSocket is fast enough to stream lower-resolution video frames into YOLO for live object tracking.

You already know how i do it…

My Recommendation

  1. Phase 1 (Get it working): Use the Stock Web Server approach just to prove your Jetson’s Python script can successfully grab a matrix, pass it to YOLO, and print a valid detection.
  2. Phase 2 (Make it cool): Flip the script. Turn the Jetson into a Python TCP/WebSocket listener, and make the ESP32 the client that blasts the image data over as soon as the button is pressed. It will feel instantaneous, reduce your code complexity on the Jetson, and give you that snappy, low-latency performance that makes hardware projects feel incredibly rewarding!
    HTH
    GL :slight_smile: PJ v:

So turns out ther is Door number 3 , AI shares this.

The “Engineering Flex” Way: MQTT with Edge-Impulse or ROS2

If you want this project to look like a professional robotics or IoT system, you use a dedicated messaging protocol.

  • How it works: You run an MQTT broker (like Mosquitto) on the Jetson Nano. The ESP32 publishes the binary payload of the JPEG to a topic like camera/images.
  • Addressing your concern: You mentioned being worried about MQTT handling image data. While MQTT isn’t designed for high-framerate 4K video streams, it is perfectly capable of handling individual compressed .jpg files (up to 256MB per message by spec, though keeping them under 100KB is ideal for microcontrollers).
  • Why it’s better: It makes your system completely modular. You could add a second ESP32 camera tomorrow, publish to the same topic, and the Jetson would automatically process images from both without changing a single line of your Jetson’s networking code.
  • The “Cool” Factor: You can subscribe to a secondary topic like jetson/predictions. Once the Jetson classifies the image, it can publish the result (“Object: Person, Accuracy: 95%”) back to the ESP32, which could flash an onboard LED or show the text on a tiny OLED screen.
  • :grin: :backhand_index_pointing_right: :robot: