Kurento Media Server: Everything You Need To Know In 2021

A short review on Kurento in simple language for business people and those who’re not into all that techy stuff.

Imagine that a programmer approached you and said that he needs a media server for development, and he recommends that you use Kurento. How do you know whether it’s the best choice? There’s a lot of information but digesting it might be difficult as it’s all deeply technical.

We’ll try our best to provide you with enough information to make a decision whether or not you should use Kurento in your video app. We’ll tell you why the media server is important, about the license, architecture, main functions that you can develop with Kurento, those being modules. We’ll finish with the summary of when it’s best to use a low-level media server, such as Kurento, and when it’s best to use an out-of-the-box solution.

WebRTC and why we need a media server

Kurento is an open-source WebRTC media streaming server with many built-in video conferencing modules released under the Apache license. WebRTC is a standardized, low latency, real-time, browser-to-browser transmission method without the need for third-party plugins or extensions. WebRTC is a fully client-side technology, so why would we need a media server?

The main reason is the load on the client with a large number of participants. The number of connections between participants grows exponentially, at the same time video quality worsens and the load on traffic and system resources grows. WebRTC can be used as normal P2P communication between 2-6 (in our experience), but if there are more participants, it makes more sense to use a media server. In addition, there are difficulties if we need to save a video recording to a separate file or somehow process it on the fly because all the work lies on the client.

A short history of Kurento

Kurento was developed in 2010 in Madrid as a separate open-source project. The main language that Kurento uses is C++, which helps to optimize the resources of the system.

The media server has more than 2500 stars on GitHub and a few hundred forks, which are separate branches of the project supported by the community.

At the moment Kurento development team joined Twilio and there are minor versions of Kurento itself with some minor patches, and new versions are still being released.

Apache, the license of Kurento

Kurento is released under the Apache license. It gives the developers absolute freedom when working with a code. You just have to mention what changes you apply and who the first author is.

Products with software under the Apache license can be used in commercial products. You can use Kurento in your products for free and get income from those products. No need to pay any loyalty.

You can find the full text about Apache here.

Kurento architecture: MCU and SFU

There are two main types of media server architectures: MCU and SFU.

MCU (Multipoint Conferencing Unit) is a collage-like video architecture. We have multiple streams of users which make up one big seamless picture with each item’s location unchanged. The MCU takes an outgoing video stream from each participant, then the media server stitches all the streams into one with a fixed layout. Thus, despite the fact that there are many participants in the conference, each client receives only one stream as input. This allows to save CPU resources and traffic consumption on the client-side, but increases the load on the server itself and limits the possibilities of video chat layout customization. With MCU we can’t tell exactly which participant is on the video. Mixing streams requires a lot of processing power, increasing the cost of maintaining the server. This type of architecture is mostly suitable for meetings with a large number of participants (> 40). MCU is also a good solution if you need to get video streams on weak devices e.g. on the phone thanks to processing on the server.

SFU (Selective Forwarding Unit) is a popular architecture in modern WebRTC solutions that allows the video conference client to receive only the video streams it needs at the moment. SFU is more like a mosaic where you have to assemble the elements yourself, but the order of assembly is up to you. In SFU each participant also sends his stream to the server, but the other stream comes separately. This architecture better distributes the load between server and client and gives full control over the implementation of the video chat interface. Unlike MCU, a server with SFU doesn’t have to decode and transcode incoming streams. This helps to significantly reduce the load on the server CPU. SFU is well suited for broadcasting (one-to-many or one-person streaming) due to its ability to dynamically scale the system depending on the number of streams. At the same time, this type of server requires more outgoing server bandwidth because it has to do more streams to clients.

Let’s compare these 2 types using the 4-people video conference as an example.

On the client:

Differences between SFU and MCU for the client

On the server:

Differences between SFU and MCU for the server

There are also systems that use hybrid architectures to achieve the best result depending on the current number of users and the needs of a particular client. For example, if the client is a weak mobile device, it can receive a single stream from the media server as in the MCU. Browser users, on the other hand, will receive streams separately, where it is possible to implement a unique way of meshing elements as in SFU. Another case would be interaction with SIP devices (mostly IP telephony) where only the MCU is supported. Some hybrids may change from SFU to MCU on the fly when the number of participants reaches a certain threshold. There is the XDN (Experience Delivery Network) architecture from red5pro, which uses cloud technologies to solve WebRTC scaling problems. It has clusters that consist of different types of nodes. There are the source, relay, and edge nodes. Within this topology, any source node receives incoming streams and exchanges data with several edge nodes to support thousands of participants. For larger cases, source nodes can pass the flow to relay nodes, which pass the flow to multiple edge nodes to scale the cluster even more.

Kurento allows both types, using SFU by default. MCU architecture can be achieved using the Compositor element. You can mix SFU and MCU to obtain a hybrid type. Kurento uses SFU by default.

Description of the main modules and functionality

What are modules: main principles of Kurento design

Kurento includes several basic modules tailored to work not only with WebRTC, but also with video recording, computer vision, and AR filters. The media server would be a good choice if you want to work directly with regular WebRTC without using additional wrappers. It can be useful if you want to integrate your video chat with a native Android and iOS app. Kurento does not include a signaling mechanism, you can choose what works best for you depending on your project requirements, be it WebSockets or something else. Since Kurento is an open-source project, this significantly reduces the cost of using a media server.

Unlike the off-the-shelf paid media servers, Kurento is a rather low-level solution that allows you to customize the interaction of modules in the pipeline. Also, unlike, for example, Jitsi, which is a boxed solution and has a ready-made interface, with Kurento it’s much easier to choose and implement the interface you want. And of course, you can write your own module extending the standard media server functionality. The basis of the Kurento architecture design is modularity, where each media element is a program block that performs a certain task and can interact with other elements. In Kurento jargon, the developers call this Media Pipeline and Media Elements. Media Elements are simply modules that connect to each other within the Pipeline. Note, however, that the topology of the Pipeline is chosen by the developers, it does not necessarily have to be a linear sequence of elements.

Let’s talk about the main modules now.

WebRTC video conference

The basis of Kurento for video conferencing is the WebRtcEndpoint. Using it, we can transfer WebRTC streams and interconnect them in a single pipeline. In a video conference, the pipelines can usually be thought of as one room that is isolated from other similar rooms. In this case, WebRtcEndpoint is the user who wants to broadcast their own video or watch someone else’s.

Recording video and audio

The media server uses RecorderEndpoint to record. We can connect the WebRTC stream to the recorder, and the recorder will record the result to the file system. After connecting and starting the video just call the Kurento API method “record” of the recorder. We can specify restrictions when recording: for example, we can record only audio or only video.

Playing third-party media

Kurento has a built-in PlayerEndpoint module that allows you to play third-party audio and video files as well as RTMP and RTSP streams. This means that you can add a stream from the IP camera and broadcast it to other participants of the video conference. You can use the player to play music for users, which is good for music platforms, online training, or even text-to-speech. Thanks to the flexibility of the pipeline, it’s possible to play media not only to all participants but also to a specific user depending on your needs. You can do this simply by creating a new PlayerEndpoint and connecting it to the desired WebRtcEndpoint using the connect method. Once you need to play a media file, just call the player’s play method.

Computer vision and filters

The media server allows you to use many filters. For example, ZBarFilter is used to recognize QR codes, FaceOverlayFilter can recognize a face in a video stream and highlight it in real-time. And with GStreamerFilter, you can create your own custom filter. Also, Kurento has several experimental modules, including filters which are installed separately. There is kms-crowddetector that can detect a crowd, and there is a kms-platedetector that can detect a vehicle’s license plate number.

No support for multi-streaming with different resolutions

There is no support for Simulcast (sending multiple copies of the same stream in different quality) and SVC (sending a stream in low quality with the ability to add additional layers to improve quality if needed).

There’s a solution. You can create several streams and request videos with different resolutions. Switch between video streams depending on packet losses. However, this solution is far from ideal as it puts a huge load on the server. So let’s be honest: there’s no solution that’d be good per se.

Codec support in Kurento

In order to stream over a network, computers use codecs. Codecs are programs that can perform data or signal encoding. They allow you to compress video or audio to an acceptable size for network streaming and playback. 

Kurento Media Server supports codecs like H.264, VP8/9 for video, and OPUS for audio. They allow you to achieve a high degree of compression of the video stream while maintaining high quality. These codecs are the current WebRTC standards. There is no support for AV1 and H.265. These are the latest standards that allow you to reduce the bitrate for the same quality as their older counterparts. The media server itself compresses the received video stream to a quality that matches the bandwidth to the client at that moment, and if necessary, transcodes the stream into the codec the client needs.

API, documentation, Typescript support

The Kurento API can be accessed via the kurento-client library and various additional libraries, e.g. kurento-utils. You can find detailed documentation on the official Kurento website with the description of every module and usage examples for Java and Javascript. you can go either here or here. The site provides information not only about the media server itself but also about related WebRTC themes, such as configuring the TURN (Traversal Using Relay NAT) server, necessary to bypass the communication limitations of users behind a symmetric NAT. The TURN protocol allows you to get the IP address and port needed to establish a WebRTC connection in the face of NAT restrictions. Symmetric NATs have additional protection against transport data tampering. The symmetric NAT table stores 2 more parameters – IP and port of the remote host. Packets from the external network are discarded because the source data doesn’t match the data recorded in the table. The client library supports types for Typescript.


Based on Kurento, there is the OpenVidu framework, which is designed for simpler development if you don’t need anything other than simple video conferencing. No need to use a server to communicate with the framework, just connect the client library. The library has many wrappers for all popular frontend frameworks as well as for Android and iOS. OpenVidu is a free solution, but there is a paid version that provides additional features for monitoring and scaling the WebRTC platform. And there are also future plans for Simulcast and SVC and automatic switching to P2P sessions for 1-1 cases.

The number of participants in a video conference

Kurento developers have their own benchmark to determine the maximum number of sessions on one machine. Although it’s originally intended for OpenVidu, it’s also suitable for Kurento. Below you can see a table for the different machines on AWS.

Video conferencing benchmarks for AWS machines

For example, a single AWS c5.large instance with two CPU cores and 4 GB of RAM can handle 4 group-type meetings with 7 participants. In each such meeting, users show and watch each other’s streams (many-to-many). 

Summary: when is Kurento suitable?

  • for a video conferencing platform
  • for integration with AR,
  • for video recording
  • for playing media files for conference participants
  • for working with pure RTP, for example, for live streaming.

Kurento is a low-level media server. If you use Kurento, development might take a bit more time than out-of-the-box solutions. That’s if your developer doesn’t specialize in Kurento.

However, Kurento allows you to implement any functionality you want. It often happens with ready-made solutions that the customer asks to create something else, but that functionality isn’t supported.

If you’re interested in Kurento media server installation for your project, request a free quote, and we’ll get back to you ASAP!


How to Make a Custom Call Notification on Android? With Code Examples

How to create a custom Android call notification

You will learn how to make incoming call notifications on Android from basic to advanced layouts from this article. Customize the notification screen with our examples.

Last time, we told you what any Android app with calls should have and promised to show you how to implement it. Today we’ll deal with notifications for incoming calls: we’ll start with the simplest and most minimalistic ones, and end with full-screen notifications with an off-system design. Let’s get started! 

Channel creation (api 26+)

Since Android 8.0, each notification must have a notification channel to which it belongs. Before this version of the system, the user could either allow or disallow the app to show notifications, without being able to turn off only a certain category, which was not very convenient. With channels, on the other hand, the user can turn off annoying notifications from the app, such as ads and unnecessary reminders, while leaving only the ones he needs (new messages, calls, and so on).

If we don’t specify a channel ID, using the Deprecated builder. If we don’t create a channel with such an ID, the notification will not be displayed with the Android 8 or later versions.

We need the androidx.core library which you probably already have hooked up. We write in Kotlin, so we use the version of the library for that language:

dependencies {

All work with notifications is done through the system service NotificationManager. For backward compatibility, it is always better to use the Compat version of Android classes if you have them, so we will use NotificationManagerCompat. To get the instance:

val notificationManager = NotificationManagerCompat.from(context)

Let’s create our channel. You can set a lot of parameters for the channel, such as a general sound for notifications and a vibration pattern. We will set only the basic ones, and the full list you can find here.

val INCOMING_CALL_CHANNEL_ID = “incoming_call”

// Creating an object with channel data

val channel = NotificationChannelCompat.Builder(

    // channel ID, it must be unique within the package


    // The importance of the notification affects whether the notification makes a sound, is shown immediately, and so on. We set it to maximum, it’s a call after all.



    // the name of the channel, which will be displayed in the system notification settings of the application

    .setName(“Incoming calls”)

    // channel description, will be displayed in the same place

    .setDescription(“Incoming audio and video call alerts”)


// Creating the channel. If such a channel already exists, nothing happens, so this method can be used before sending each notification to the channel.


How to create notification channel on Android

Displaying a notification

Wonderful, now we can start creating the notification itself, let’s start with the simplest example:

val notificationBuilder = NotificationCompat.Builder( 


    // channel ID again



    // A small icon that will be displayed in the status bar


    // Notification title

    .setContentTitle(“Incoming call”)

    // Notification text, usually the caller’s name

    .setContentText(“James Smith”)

    // Large image, usually a photo / avatar of the caller

    .setLargeIcon(BitmapFactory.decodeResource(resources, R.drawable.logo))

    // For notification of an incoming call, it’s wise to make it so that it can’t be “swiped”


        So far we’ve only created a sort of “description” of the notification, but it’s not yet shown to the user. To display it, let’s turn to the manager again:

// Let’s get to building our notification

val notification =

// We ask the system to display it

notificationManager.notify(INCOMING_CALL_NOTIFICATION_ID, notification)

How to display a notification for Android

    The INCOMING_CALL_NOTIFICATION_ID is a notification identifier that can be used to find and interact with an already displayed notification.

        For example, the user wasn’t answering the call for a long time, the caller got tired of waiting and canceled the call. Then we can cancel notification:


        Or, in the case of a conferencing application, if more than one person has joined the caller, we can update our notification. To do this, just create a new notification and pass the same notification ID in the notify call — then the old notification will just be updated with the data, without animating the appearance of the new notification. To do this, we can reuse the old notificationBuilder by simply replacing the changed part in it:

notificationBuilder.setContentText(“James Smith, George Watson”)




Button actions upon clicking

A simple notification of an incoming call, after which the user has to find our application himself and accept or reject the call is not a very useful thing. Fortunately, we can add action buttons to our notification!

To do this, we add one or more actions when creating the notification. Creating them will look something like this:

val action = NotificationCompat.Action.Builder(

    // The icon that will be displayed on the button (or not, depends on the Android version)

    IconCompat.createWithResource(applicationContext, R.drawable.icon_accept_call),

    // The text on the button


    // The action itself, PendingIntent



Wait a minute, what does another PendingIntent mean? It’s a very broad topic, worthy of its own article, but simplistically, it’s a description of how to run an element of our application (such as an activity or service). In its simplest form it goes like this:

const val ACTION_ACCEPT_CALL = 101

// We create a normal intent, just like when we start a new Activity

val intent = Intent(applicationContext, {



// But we don’t run it ourselves, we pass it to PendingIntent, which will be called later when the button is pressed

val acceptCallIntent = PendingIntent.getActivity(applicationContext, REQUEST_CODE_ACCEPT_CALL, intent, PendingIntent.FLAG_UPDATE_CURRENT)

Accordingly, we need to handle this action in activity itself

To do this, in `onCreate()` (and in `onNewIntent()` if you use the flag `FLAG_ACTIVITY_SINGLE_TOP` for your activity), take `action` from `intent` and take the action:

override fun onNewIntent(intent: Intent?) {


    if (intent?.action == ACTION_ACCEPT_CALL) 



Now that we have everything ready for our action, we can add it to our notification via `Builder`


How to add notification buttons on Android

In addition to the buttons, we can assign an action by clicking on the notification itself, outside of the buttons. Going to the incoming call screen seems like the best solution — to do this, we repeat all the steps of creating an action, but use a different action id instead of `ACTION_ACCEPT_CALL`, and in `MainActivity.onCreate()` handle that `action` with navigation

override fun onNewIntent(intent: Intent?) {


    if (intent?.action == ACTION_SHOW_INCOMING_CALL_SCREEN)



You can also use `service` instead of `activity` to handle events.

Notifications with their own design

Notifications themselves are part of the system interface, so they will be displayed in the same system style. However, if you want to stand out, or if the standard arrangement of buttons and other notification elements don’t suit you, you can give the notifications your own unique style.

DISCLAIMER: Due to the huge variety of Android devices with different screen sizes and aspect ratios, combined with the limited positioning of elements in notifications (relative to regular application screens), Custom Content Notification is much more difficult to support

The notification will still be rendered by the system, that is, outside of our application process, so we need to use RemoteViews instead of the regular View. Note that this mechanism does not support all the familiar elements, in particular, the `ConstraintLayout` is not available.

A simple example is a custom notification with one button for accepting a call:

<!– notification_custom.xml –>













        android:textColor=”@color/fora_white” />


The layout is ready, now we need to create an instance RemoteViews and pass it to the notification constructor

val remoteView = RemoteViews(packageName, R.layout.notification_custom)

// Set the PendingIntent that will “shoot” when the button is clicked. A normal onClickListener won’t work here – again, the notification will live outside our process

remoteView.setOnClickPendingIntent(, pendingIntent)

// Add to our long-suffering builder


How to create a custom notification on Android

Our example is as simplistic as possible and, of course, a bit jarring. Usually, a customized notification is done in a style similar to the system notification, but in a branded color scheme, like the notifications in Skype, for example.

In addition to .setCustomContentView, which is a normal notification, we can separately specify mark-up for expanded state .setCustomBigContentView and for the head-up state .setCustomHeadsUpContentView

Full-screen notifications

Now our custom notification layouts match the design inside the app, but they’re still small notifications, with small buttons. And what happens when you get a normal incoming call? Our eyes are presented with a beautiful screen that takes up all the available space. Fortunately, this functionality is available to us! And we’re not afraid of any limitations associated with RemoteViews, as we can show the full `activity`.

First of all, we have to add a permission to `AndroidManifest.xml

<uses-permission android:name=”android.permission.USE_FULL_SCREEN_INTENT” />

After creating an `activity` with the desired design and functionality, we initialize the PendingIntent and add it to the notification:

val intent = Intent(this,

val pendingIntent = PendingIntent.getActivity(applicationContext, 0, intent, PendingIntent.FLAG_UPDATE_CURRENT)

// At the same time we set highPriority to true, so what is highPriority if not an incoming call?

notificationBuilder.setFullScreenIntent(pendingIntent, highPriority = true)

Yes, and that’s it! Despite the fact that this functionality is so easy to add, for some reason not all call-related applications use it. However, giants like Whatsapp and Telegram have implemented notifications of incoming calls in this way!

How to create a full screen notification on Android

Bottom line

The incoming call notification on Android is a very important part of the application. There are a lot of requirements: it should be prompt, eye-catching, but not annoying. Today we learned about the tools available to achieve all these goals. Let your notifications be always beautiful!


Video conference and text chat software development

video conference

5 Russian government agencies and both major telecom operators are clients of We developed a new version of their video conference and chat for businesses. Agencies meet there and telecom operators resell it to businesses under their brands. Read a reference from the client on Clutch – search for “Intermind”.






Features for video, audio, and text communication software

🎦 WebRTC videoconference

We develop for any number of participants:

  • One-on-one video chats
  • Video conferences with an unlimited number of participants

50 live videos on one screen at the same time was the maximum we’ve done. For example, Zoom has 100 live video participants, though it shows 25 live videos on one screen. To see the others, you switch between screens.

Some other functions: custom backgrounds, enlarging videos of particular participants, picking a camera and microphone from the list, muting a camera and microphone, and a video preview of how you look.

🎬 Conference recording

Record the whole screen of the conference. Set the time to store recordings on the server. For example, on we keep videos for 30 days on a free plan forever on the most advanced one.

Do not interrupt the recording if the recorder dropped off. In Zoom, if the recorder leaves, the recording stops. In it continues.

💻 Screen sharing and sharing multiple screens simultaneously

Show your screen instead of a video. Choose to show everything or just 1 application – to not show private data accidentally.

Make all video participants share screens at the same time. It helps to compare something. Users don’t have to stop one sharing and start another one. See it in action at

☎️ Join a conference from a landline phone

For those in the countryside without an Internet connection. Dial a phone number on a wired telephone or your mobile and enter the conference with audio, without a video. SIP technology with Asterisk and FreeSWITCH servers powers this function.

💬 Text chat

Send text messages and emoticons. React with emojis. Send pictures and documents. Go to a private chat with one participant. See a list of participants.

✒️ Document editing and signing

Share a document on the conference screen. Scroll through it together, make changes. Sign: upload your signature image or draw it manually. Convenient for remote contract signing in the pandemic.

📋 Polls

Create polls with open and closed questions. View statistics. Make the collective decision-making process faster!

🎙 Webinars

In the broadcast mode, display a presentation full-screen to the audience, plus the presenter’s video. Add guest speakers’ videos. Record the whole session to share with participants afterward.

⌚️ Everlasting rooms with custom links

Create a room and set a custom link to it like It’s convenient for regular meetings. Ask participants to add the link to bookmarks and enter at the agreed time each time.

👥 User management

Assign administrators and delegate them the creation of rooms, addition, and deletion of users.

🔐 Security

  • One-time codes instead of passwords
  • Host approves guests before they enter the conference
  • See a picture of the guest before approving him
  • Encryption: we enable AES-256 encryption in WebRTC

🎨 Custom branding

Change color schemes, use your logo, change backgrounds to corporate images.

🗣 Speech-to-text and translation

User speech is recognized and shown on the screen. It can be in another language for translation.

📺 Watch videos together online

Watch a movie or a sports game together with friends. Show an employee onboarding video to the new staff members. Chat by video, voice, and text.

📝 Subscription plans

Free plans with basic functionality, advanced ones for pro and business users.

Industries Fora Soft developed real-time communication tools for

  • 👨‍💼 Businesses – corporate communication tools
  • 🧑‍⚕️ Telemedicine – HIPAA-compliant, with EMR, visit scheduling, and payments
  • 👨‍🎓 E-learning – with whiteboards, LMS, teacher reviews, lesson booking, and payments
  • 👩‍🎤 Entertainment: online cinemas, messengers
  • 🏋️ Fitness and training
  • 🛍 Ecommerce and marketplaces – text chats, demonstrations of goods and services by live video calls

Devices Fora Soft develops for

  • Web browsers
    Chrome, Firefox, Safari, Opera, Edge – applications that require no download
  • Phones and tablets on iOS and Android
    Native applications that you download from AppStore and Google Play
  • Desktop and laptop computers
    Applications that you download and install
  • Smart TVs
    Javascript applications for Samsung and LG, Kotlin apps for Android-based STBs, Swift apps for Apple TV
  • Virtual reality (VR) headsets
    Meetings in virtual rooms

🛠 What technologies to develop a custom video conference on

Basic technology to transmit video

Different technologies suit best for different tasks:

  • for video chats and conferences – WebRTC
  • for broadcasting to a big audience – HLS
  • for streaming to third-party products like YouTube and Facebook – RTMP
  • for calling to phone numbers – SIP
  • for connecting IP cameras – RTSP and RTP

Freelancer or an agency that does not specialize in video software may pick the technology they are best familiar with. It might be not the best for your tasks. In the worst case, you’ll have to throw the work away and redo it. 

We know all the video technologies well. So we choose what’s best for your goal. If you need several of these features in one project – a mix of these technologies should be used. 

WebRTC is the main technology almost always used for video conferences though. This is the technology for media streaming in real-time that works across all browsers and mobile devices people now use. Google, Apple, and Microsoft support and develop it.

WebRTC supports VP8, VP9 and H264 Constrained Baseline profile for video and OPUS, G.711 (PCMA and PCMU) for audio. It allows sending video up to 8,192 x 4,320 pixels – more than 4K. So the limitations to video stream quality on WebRTC are the internet speed and device power of the end-user. 

WebRTC video quality is better than in SIP-based video chats, as a study of an Indonesian university shows. See Figure 6 on page 9: Video test results and read the reasoning below it.

Is a media server needed for video conferencing software development?

For video chats with 2-6 participants, we develop p2p solutions. You don’t pay for the heavy video traffic on your servers.

For video conferences with 7 and more people, we use media servers and bridges – Kurento is the 1st choice. 

For “quick and dirty” prototypes we can integrate third-party solutions – ready implementations of video chats with media servers that allow slight customization. 

  • p2p video chats

P2p means video and audio go directly from sender to receivers. Streams do not have to go to a powerful server first. Computers, smartphones, and tablets people use nowadays are powerful enough to handle 2-6 streams without delays.

Many businesses do not need more people in a video conference. Telemedicine usually means just 2 participants: a doctor and a patient. The development of a video chat with a media server is a mistake here. Businesses would have to pay for the traffic going through the server not receiving any benefit.

  • Video conferences with a media server

Users cannot handle sending more than 5 outgoing video streams without lags now. People’s computers, smartphones, and tablets are not powerful enough. While sending their own video, they accept incoming streams. So for more than 6 people in video chat – each sends just 1 outgoing stream to a media server. The media server is powerful enough to send this stream to each participant.

Kurento is our first choice of media servers now for 3 reasons:

  • It is reliable.

    It was one of the first media servers to appear. So it gained the biggest community of developers. The more developers use technology the faster they solve issues, the quicker you find the answers to questions. This makes development quicker and easier, so you pay less for it.

    Twilio bought Kurento technology for $8.5 million. Now Twilio provides the most reliable paid third-party video chat solution, based on our experience.

    In 2021, other media servers have smaller developers’ and contributors’ communities or are backed by not-so-big companies, based on our experience and impression. They either are not as reliable as Kurento or do not allow developing that many functions.
  • It allows adding the widest number of custom features.

    From screen sharing to face recognition and more – we have not faced any feature that our client would want, not possible to develop with Kurento. To give developers this possibility, the Kurento contributors had to develop each one separately and polish it to a well-working solution. Other media servers did not have that much time and resources to offer the same.
  • It is free.

    Kurento is open-source. It means you may use it in your products legally for free. You don’t have to pay royalties to the technology owner.

We work with other media servers and bridges – when not that many functions are needed, or it is an existing product already using another media server:

We compare media servers and bridges regularly as all of them develop. Knowing your needs, we recommend the optimal choice.

  • Integration of third-party solutions

Third-party solutions are paid: you pay for minutes of usage. The development of a custom video chat is cheaper in the long run.

Their features are also limited to what their developers developed.

They are quicker to integrate and get a working prototype though. If you need to impress investors – we can integrate them. You get your app quicker and cheaper compared to the custom development.

However, to replace it with a custom video chat later – you’ll have to throw away the existing implementation and develop a custom one. So, you’ll pay twice for the video component.

We use these 3 -they are the most reliable ones based on our experience:

Write to us: we’ll help to pick optimal technologies for your video conference.

💵 How much the development of a video conference costs

You’re here – means ready solutions sold as is to integrate into your existing software probably do not suit you and you need a custom one. The cost of a custom one depends on features and their complexity. So we can’t say the price before knowing these.

Take even the log in function as an example. A simple one is just through email and password. A complex one may have a login through Facebook, Google, and others. Each way requires extra effort to implement. So the cost may differ several times. And login is the simplest function for a few work hours. Imagine how much the added complexity will influence the cost of more complex functions. And you’d probably have quite a lot of functions.

Though we can give some indications.

✅ The simplest video chat component takes us 2-4 weeks and costs USD 8000. It is not a fully functioning system with login, subscriptions, booking, etc. – just the video chat with a text chat and screen sharing. You’d integrate it into your website or app and it would receive user info from there. 

✅ The simplest fully functional video chat system takes us about 4-5 months and around USD 56 000. It is built from the ground up for one platform – either web or iOS or Android for example. Users register, pick a plan, and use the system.

✅ A big video conferencing solution development is an ongoing work. The 1st release takes about 7 months and USD 280 000.
Reach us, let’s discuss your project. After the 1st call, you get an approximate estimation.


Minimizing latency to less than 1 sec for mass streams

Is it possible to achieve less than a second latency in a video broadcast? What if the stream goes to a thousand people? Yes. How? Let’s answer using the project WorldCast Live as an example. We did it using WebRTC. WCL streams HD concerts to an audience of hundreds and thousands of people.

Why would I reduce the broadcast latency?

Latency less than a second in a videoconference is normal, otherwise speaking is nearly impossible. For one-sided streaming, the latency of 2 or even 20 seconds is fine. For example, TV delay is about 6 sec. However, there are cases where you’d try to reduce it to 1 second.

  • Sport events

It’s highly unlikely that the user will be happy when their neighbors shout, GOAL!, and he still sees the ball somewhere in the middle of the field. What if the user is also live betting?

  • Interviews

Thanks to the pandemic, not only talks with friends have moved online, but also interviews with celebrities. Take an interviewer’s latency, add to the interviewee’s latency, and add the time spent while it all goes to the user. The more you get, the worse the experience is for all sides.

  • Concerts

The WCL player is embedded in different sites. Concerts broadcast online to all of them. If you and your friend use different sites to watch the same concert, it will negatively affect your viewing experience. Thus, the concert organizers are trying to minimize the latency. Although the user wouldn’t really care whether he hears the guitar riff now or in 2 seconds, there is a general trend on minimizing the latency. The faster the better!

And also…

The examples above combine. In WCL viewers and performers talk via a video chat. Therefore, the latency has to be like that of a video chat, as we need the minimum difference in time between the question and the answer.

How to reach low latency in live streaming?

Use WebRTC

The standard WebRTC package offers average latency of 500 ms (half a second). We could’ve finished the article here: creating a video chat where people can connect with each other isn’t difficult. What’s difficult is making it all work stable for thousands of people and customizing streams to improve their quality. 

Base WebRTC isn’t about HD streaming. Video and audio quality is enough for speaking but isn’t enough for streaming music. To decrease the latency, WebRTC can reduce the quality and skip parts of the content. To avoid it, you have to go under the hood of WebRTC, which we did.

Set up WebRTC

To ensure that the low latency doesn’t go against quality and the final user’s experience, you need to go for extra development. Here’s what we did for WCL so it can broadcasts concerts to thousands of people:

  • Enabled multi-channel audio

The standard audio in WebRTC is mono, it’s 1 channel. The stereo is 2 channels. There are 5 channels on WCL.

  • Upped bitrate

WebRTC settings limit the video transmission speed to 500 kb/s. That’s not much for action on the screen: if light colors change quickly with this limit, the quality will be lower because of trying to go inside that channel, pixels might appear. Not a great watching experience. Therefore we’ve increased the bitrate to 1,5 Gb to transmit HD video.

  • Increased discretization frequency

By this, we’ve improved the quality of audio and video, low and high frequencies. They are not lost anymore. We can’t disclose what exactly we did, but if you’re interested in that, let us know!

Scale Kurento

Kurento Media Server is an open-source WebRTC server.  Set up Master Kurento to stream video. 500 hundred people will connect directly to Master Kurento, from where they’ll get the stream. If there are more viewers, Edge Kurento is in play, to which other viewers connect. The more viewers there are on the stream, the more Edge Kurento you need to use. All together it creates a tree-like scheme.

When do you leave the latency of more than a second?

When the budget is limited. WebRTC is more expensive than HLS if you need scaling. The WebRTC server on WCL costs $0,17/h which is $122,4 monthly if we take 30 days.

HLS, however, costs $0,023/h and can be turned on and off, unlike WebRTC. If we take 3 hour-long concerts a week, then we’ll spend a bit less than $0,28 for 12 concerts. Note that there are many server providers, the prices are always different, but you can see what it looks like using our project as an example.

The first stable version of subsecond latency on WCL took us 3,5 weeks. If there’s no necessity in very low latency, why spend?


Latency less than a second in a broadcast is necessary when there’s communication between participants or a stream where things change all the time. WebRTC makes it possible for even a thousand people if you work with the technology.

If your app is about something from here, make sure to take a look at WebRTC. Contact us, too, we’ll help! Contact us via our form or DM on Instagram, which we created not so long ago 🙂


How to estimate the server cost for a video platform

Let’s take a look at how to estimate a video streaming platform after the launch.

The combined cost is based on the cost of traffic, data storing, and the server itself.

  • Traffic – the information that goes through our servers. In general, it’s audio and video streams. As a rule of thumb, services require way more money for the outgoing traffic than for the incoming traffic. So, we will only concentrate on the outgoing traffic here
  • Data storing – storing video conference recordings. You can store them on a rented server if there is not too much data. Otherwise, you will have to pay specifically for storing data
  • The server itself is the computer with the project. Usually, its rent costs a specific monthly amount. When one server isn’t enough because of too many users, people would use many servers instead. More on that later.

Traffic costs

Counting the traffic costs is the first task as it is one of the main expenditures on video streaming services. To complete it, one must calculate the amount of traffic that will go through our server within a chosen period of time. We will use 1 hour.

There is a convenient website to help us calculate the size of video files. How to use it? Insert the video resolution, duration, and FPS. You can compare terms such as 480p or 4k against “real” ones here.

You will see a sheet that might look difficult to understand, but it’s not true. Just search for Video File Size. This sheet has possible sizes that our video will have after using different compressing algorithms. For simplicity’s sake, we’ll use the compression YouTube uses. It is easy to realize and WebRTC video compression is close to the one we went with. The price will increase drastically if we decide to go with a lesser compression. Using more powerful compression may result in losing quality. It all depends on what our goal is. We will still be using YouTube-style compression as an example, and the amount of traffic we get in result will correspond to a one-way video stream.

For 1-on-1 chats, there will be 2 streams. If there are 1 streamer and 20 viewers – 20 streams. 2 streamers + 50 viewers – 100 streams (each of the 2 streams is sent to 50 viewers) or 102, if streamers receive streams from one another. Now, since we have the amount of traffic for a period of time, we need to multiply it by the price of our provider set. By the provider, we mean AWS, DigitalOcean, or our telecom provider if the servers are within our reach.

We only look at the outgoing traffic. That’s because the majority of providers offer either free unlimited incoming traffic or set such a low price that it’s not even worth checking here.

So, here is our traffic price. However, it’s just one third, so let’s move on to…

Price for storing data

If you don’t keep your stream recordings, feel free to skip this section, as the price that doesn’t concern storing video content is very low.

So, you need to keep video recordings on your server and you don’t know how much it will cost? Let us help you. Get the video size per one stream from the previous section and multiply it by the number of streamers. Multiply the result by the price for storing data on your server (calculate the price of HDD/SDD if that’s your private server).

So, all we have to do now is counting the price of keeping and supporting the server.

The server itself

Media server price

If the project is going to be somewhat loaded, you will have to pay for the hardware that sends out your streams. The amount of servers depends on the load, and they all have to be paid hourly. If you know the cost, perfect. If not, try adding something like $0,3 per hour (this is an average price for some Amazon EC2 category c5 servers)

Price for a reserved server

It’s more likely that your project will have at least one more central server with the main website on it. One would usually pay a fixed monthly amount for it. It’d be amazing if you knew how much your server cost. If you don’t, feel free to use the market average as of now – $30-$70 monthly. If it’s a simple landing without added functionality, $15 will probably do. However, if that’s a good website with an account room, video conference rooms list, text chat, notifications, and a stable online amount of more than 100 people daily, the server will cost more (from $50).


Take a 9-max meeting room. We’ll go with 640×480, compression algorithm equivalent to YouTube’s, 30 FPS (in the calculator this goes as 29.97), Amazon AWS c5 xlarge servers. Don’t forget to swap seconds for minutes!

Let’s calculate the traffic costs. Use the calculator and fill in the resolution and FPS. Check the result – 1 hour of a one-way video will take 639 Mb. Every user receives 8 streams from other participants. Therefore, we have 9*8=72 streams like that.

We get 46 GB per hour. Since we use Amazon, the traffic cost is $0,09-$0,01 per GB, depending on how much traffic was used up in the current month. In order to get to $0,01, we will have to use an immensely huge amount of traffic (40 TB), so we will use $0,08 for calculation. 46 GB by $0,08 is $3,68/h.

Let’s say we want to store the video; the size of 1 hour is 0,639 GB. We want to keep the video streams of all 9 people. The server will keep one video where other clips will be combined in one grid. Therefore, it will still take 0,639 GB. We use General Purpose SSD by Amazon S2. It costs $0,1 per GB, monthly! Therefore, we have to pay ~$0,06 for each hour of our videos each month they stay on our servers.

Amazon AWS c5 xlarge server costs $0,17 per hour. Add it to the traffic cost. Therefore, we get $3,85 as a cost for each hour of active meetings, and on top of that, $0,06 monthly for storing 1 hour of meetings.

If we want to imagine how much it will take us, we are going to need at least approximate numbers of streams daily and monthly.

Let’s say we have 4 hours of meetings like that in a day. Assuming there are 30 days in a month. we multiply 4 hours by $3,85 to get the service cost per day – $15,4, which we then multiply by 30 to find out the monthly cost with 4 hours of meetings daily =$462.

Let’s assume that we allow storing up to 100 hours of video. Add $6 monthly.

Amazon c5 large is our main server. It costs $45,99 when reserved.

To sum up all our approximate estimations for a month, we’ll pay $513,99.

Webinar with 2 streamers and 50 viewers

We keep the similar parameters: 640×480, compression algorithm equivalent to YouTube, 30 FPS, Amazon AWS c5 2xlarge servers.

The stream will still be 639 MB per hour. Since each streamer streams for 50 viewers, we have 100 streams – 63,9 GB per hour. The average cost of traffic stays at $0,08 per GB. 63,9 GB * $0,08 = $5,11 per hour.

We are not storing this video, but if we were, it would have been just 1 stream (we have 2 streamers, so we store them in the grid as 1 video), so it’s still 0,639 GB per hour. With the price of $0,01 for an hour, we’d get $0,05112 per month, for each hour of stored video.

The final result is $5,11 for an hour of a stream or $0,05112 per month for each hour of stored video, should we ever need that.

We’ll reserve Amazon c5 large as the main server. Its cost is still $45,99. So with the server, the cost is $1599,59.

1-on-1 video chats via p2p

We can tell from experience that one c5 large server can maintain more than 1000 simultaneous 1-on-1 video chats via p2p. We’ll be using that server. In this case, the traffic cost is minimal because we just help two users connect with each other. We don’t send any video streams through our servers. Nor does our traffic go through the server and we don’t need to calculate it. The price is negligible. We don’t store recordings, too.

The servers cost $0,17 per hour. For the insurance measures let’s imagine that we get a new server up each time we get 1000 people.

The approximate price for a month if we have up to 1000 chats every day, and they last for 8 hours. 0,17*8=$1,36 a day. Multiply that by 30 to get $40,8 a month.

The server we reserve is Amazon c5 large. With added $45,99, we have to pay $86,79 monthly.


What is WebRTC? Explanation in plain language

The majority of WebRTC-related material is about the application level of code writing and doesn’t help understand the technology. Let’s dive deeper into the topic and find out how the connection establishes, why we need the TURN and STUN servers, and what a session descriptor and candidates are.


What is WebRTC for?

WebRTC is a browser-oriented technology that allows us to connect to clients to transmit video data. Internal browser support (external technologies, such as Adobe Flash, aren’t needed) and an ability to connect clients without using any additional servers (p2p connection) are the main peculiarities of WebRTC.

Establishing a p2p connection is complicated as computers don’t always possess public IPs (their internet addresses). Due to a low amount of IPv4 addresses and for the security’s sake, NAT was invented. It allows creating private networks, for instance, for home use. Many home routers support NAT, thus all devices that are connected to the router have internet access, although service providers usually allow one IP address. Public IPs are unique, whereas private ones aren’t, hence p2p connection is difficult.

To better understand the concept, let’s take a look at three scenarios:

  1. Both nodes are within the same network 
  1. Both nodes are within different networks (private and public) 
  1. Both nodes are within different private networks with the same IPs

The first letter in the images above represents a node type: r for a router, p for a peer.

  1. Image one shows a nice situation. Nodes within their networks identify with network IP addresses, and they can directly connect with each other. 
  2. Image two shows two different networks with similarly sequenced nodes. We introduce routers here, and they have two network interfaces: inside and outside their system. Hence, they have two IPs. Usually, nodes have only one interface, and they use it to interact within their networks, and if they transmit data to something outside their system, they only do it with the help of NAT inside of a router. That’s why these nodes appear as a router IP address – it’s their external IP. Therefore, the p1 node has an external IP ( and an internal one (, with the first address also being external for all other nodes within the network. The p2 node experiences similar circumstances, so their connection is impossible as long as only their internal addresses are used. It’s possible to go with external IPs, but it poses a challenge since all nodes within the same private network are under the same external address. NAT solves this problem.
  3. What happens if we decide to connect nodes via their internal addresses? The data won’t leave the network. To magnify the effect, imagine a situation from the third image, where both nodes have the same internal addresses. If they use those addresses to communicate with one another, they both will be communicating with themselves.

Here is where WebRTC steps in. In order to solve these problems, WebRTC uses the ICE protocol which requires additional STUN and TURN servers. 

The two phases of WebRTC

In order to connect two nodes with the WebRTC protocol (or just RTC if there are two iPhones), it’s necessary to complete some preliminary steps to establish a connection. That’s the first phase. The 2nd phase is video data transmission.

Although WebRTC uses lots of means of communication (TCP and UDP) and can flexibly switch between them, this technology does not possess a protocol for transmitting connection data, which is not surprising as connecting two p2p nodes isn’t a simple task. Having said that, we need an additional, not related to WebRTC, data transmission way. It can be an HTML protocol, socket transmission, or an SMTP protocol. This way of sending initial data is a signaling mechanism. Not too much information is transmitted. The data is transmitted as text and is split into two categories: SDP and Ice Candidate (you can also read about them here) SDP is used to establish a logical connection, Ice Candidate is for a physical connection. It’s important to remember that WebRTC gives you the information that needs to be passed on to the next node. As soon as we transmit the necessary information, the nodes will be able to connect, and our help won’t be needed anymore. Therefore, a signaling mechanism which we need to create separately, will be used only upon connection and not while we transmit video data.

So, let’s take a look at the first phase. It consists of several steps. First, let’s look at it as for the connection initiating node, and then as for the connection receiving node.

  • Initiator (caller):
  1. Receiving a local media stream and establishing its transmission (getUserMediaStream)
  2. An offer to begin video data transmission (createOffer)
  3. Receiving an own SDP object and sending it via the signaling mechanism (SDP)
  4. Receiving own Ice candidate objects and sending them via the signaling mechanism (Ice candidate)
  5. Receiving a remote media stream and showing it on the screen (onAddStream)
  • Receiver (callee)
  1. Getting a local media stream and establishing its transmission (getUserMediaStream)
  2. An offer to begin video data transmission and answer creation (createAnswer)
  3. Receiving an own SDP object and sending it via the signaling mechanism (SDP)
  4. Receiving own Ice candidate objects and sending them via the signaling mechanism (Ice candidate)
  5. Receiving a remote media stream and showing it on the screen (on AddStream)

Only the 2nd step is different.

However complicated these steps might seem, as a matter of fact, there are just three of them: sending a local media stream (step 1), establishing the connection parameters (steps 2-4), receiving a remote media stream (step 5). The 2nd step is the most difficult one as it consists of two parts – we need to establish the logical and physical connection. The latter shows the way for the packet to follow to get from one node to the other, and the former points at video and audio parameters – what quality and codecs to use.

Connect the createOffer and the createAnswer steps to the steps with the transmission of SDP and Ice Candidate objects.

Now we are going to take a look at some entities, such as MediaStream, SDP, and Ice Candidate.

Main entities


MediaStream is a basic entity, it consists of video and audio data streams. There are two types of media streams, local and remote. Local streams receive data from the input devices (camera, mic), remote streams receive data from the network.

Therefore, every node has a local and a remote stream. In WebRTC, for these streams, there is a MediaStream interface, as well as a LocalMediaStream sub-interface which is out there specifically for a local stream. In JavaScript, you can only face the former one, but if you use libjingle, you can also encounter the latter.

WebRTC suggests a difficult hierarchy within a stream. Every stream consists of several media tracks (MediaTrack), which can consist of several media channels (MediaChannel). There can also be several media streams.

For example, we not only want to transmit a video of ourselves but also our table with a piece of paper on it, as we are about to write something on the piece of paper. We’ll need two videos (of us and the table) and one audio (us). Obviously, we and the table should be divided into different streams, as they aren’t really dependent on each other. That’s why we’ll have two MediaStreams: one for us and one for the table. The first one will have video and audio data, and the 2nd one – video data only.

The media stream has to provide an opportunity to keep different types of data, namely video and audio. This is accounted for in the technology, therefore every data type is realized through MediaTrack. MediaTrack has a special quality called kind which determines whether it’s a video or audio before us.

So how does everything happen inside the program? We create two media streams. Then we’ll proceed to create two video tracks and one audio track. Get access to the camera and microphone. Tell every track what feature it needs to use. Add a video and audio tracks into the first media stream and the video track from the 2nd camera – into the 2nd media stream.

How to distinguish media tracks on the other end? By the feature label that every media channel has. Media tracks have the same feature.

So, if we could identify the media tracks with a mark, why do we need to use two of them instead of one in this example? You can transmit one media stream and use different tracks within it. Now we’ve reached an important feature of media tracks, they synchronize media tracks. Different media tracks aren’t synced between each other, but all tracks are played simultaneously within each media track.

Therefore, if we want our words, facial expressions and the piece of paper to be played at the same time, we need to use the same media track. If it’s not too important, it’d be better to use different media tracks, so the picture is smoother.

If a track needs to be switched off during the transmission, we can use the enabled feature of a media track.

In the end, it’d be nice to think about stereo sound. Stereo is two different sounds, and they have to be transmitted separately. MediaChannel is used for that. Media track can use different channels (for instance, 6 if we need a 5+1 sound). The channels inside the media track are also synced. When a video is played, usually one channel is used, but it’s possible to use several of them, for example, to apply advertisement.

To summarize: we use a media stream to transmit video and audio data. The data is synced inside each media stream. We could use different media channels if we don’t aim for synchronization. There are two media tracks inside each stream, for video and audio. There can be more tracks if we need to transmit different videos (interlocutor and their table). Every track can consist of different channels but usually is used for stereo sound only.

In the simplest situation, we won’t have a video chat, and there’ll only be one local media stream of two tracks, audio and video. Each track will consist of one primary channel. The video track is responsible for the camera, while the audio track – for the microphone. The media stream is a container for both of them.

Session descriptor (SDP)

Different computers have different cameras, mics, graphics cards, etc. There is a multitude of parameters to them. It all needs to be coordinated for the media data transmission between two network nodes. WebRTC does it automatically and creates a special object – SDP. Transmit SDP to another node, and you can transmit the video data. There is no connection with another node, though.

Any signaling mechanism can help here. SDP can be sent via sockets, humans (tell the SDP to another node via the phone), or.. well, post office. You get a ready SDP, and it needs to be sent out – as simple as that. When the other guy receives SDP, they need to send it to WebRTC. It is stored as a text and can be changed from the applications, but it’s rarely needed. As an example, with a desktop <-> phone connection, sometimes it’s obligatory to forcefully choose the right audio codec.

Usually, when the connection is established, it’s obligatory to mention an address, such as a URL. There is no necessity to do it here, as you yourself will send the data via the signaling mechanism. To tell WebRTC that we want to establish a p2p connection, function createOffer has to be invoked. After that’s been done, and the special callback created, a new SDP object will be created and sent to the same callback. All you need to do is transmit this object to another node (interlocutor) via the network. 

The signaling mechanism will help data, this SDP object, to arrive. This session descriptor is alien for this node, therefore it bears useful information. 

Receiving this object is a signal to start the connection. So, you have to agree with it and call the createAnswer function. It is an absolute analog to createOffer. Your callback will receive a local session descriptor and then will need to be transmitted via the signaling mechanism back again.

It’s worth mentioning that calling a createAnswer function is only possible after receiving an alien SDP object. That’s because the local SDP object that will generate upon calling createAnswer has to rely on a remote SDP object. Only then will it be possible to coordinate your video settings with those of your interlocutor. Also, don’t call createAnswer and createOffer before receiving a local media stream, as they will have nothing to write to the SDP object.

Since WebRTC allows you to edit an SDP object, you will need to install the local descriptor upon receiving it. Sending the things WebRTC gave us back to it might seem strange, but that’s the protocol. A remote descriptor also needs to be installed upon receiving.

After this handshaking of some sort, the nodes will learn about each other’s wishes. For example, if node 1 supports codecs A and B, and node 2 supports codecs B and C, they both will choose codec B. That’s because these nodes know local and alien descriptors. The connection logic has been established and it’s possible to send media streams now. There is another problem, though: the nodes are still connected with just a signaling mechanism.

Ice candidates

Upon establishing a connection, the address of the node that you need to connect with isn’t mentioned. First, logical connection establishes, then physical, although it used to be the other way around. It won’t be so strange, however, if we keep in mind that we use an external signaling mechanism.

So, the logical connection has been established but there’s no path that the nodes can use to transmit data yet. Not everything is simple here, but we can still start with the simple things. Imagine that the nodes are within the same private network. As we know, they can easily connect with each other via their internal IPs (or other addresses if TCP/IP is not in use).

WebRTC tells us the Ice candidate objects through some callbacks. They too arrive in the form of text, and they too need to be sent through a signaling mechanism, just like the session descriptors. If the session descriptor contained information about our settings on the camera and phone level, candidates do that with our placement inside a network. Send them to another node, and it will be able to logically connect with us. As it already has a session descriptor, the data will flow in. If it doesn’t forget to send us its candidate object (information on where it’s placed inside the network), we’ll be able to connect with it.

There is another difference from a classical client-server interaction. Communication with an HTTP server goes as request-answer. The client sends data to the server, the server processes it and sends it to the address mentioned in the request packet. It’s obligatory to know two addresses in WebRTC and connect them from both sides.

The difference from session descriptors is that only remote candidates have to be installed. Editing is prohibited here and won’t be of use. In different WebRTC realizations, candidates must be installed only after the installation of session descriptors.

So, why can there be one session descriptor but lots of candidates? Because placement within a network can be determined not only by an own internal IP address but also by an external router address (one or more) and by TURN server addresses.

So, we have two candidates within one network (picture below). How to identify them? With the help of IP addresses only. Of course, different transport can be used (TCP and UDP), as well as different ports. This is the information that’s contained inside the candidate object – IP, TRANSPORT, PORT, etc. For instance, let’s take port 531 and UDP transport.

So, when we’re inside the p1 node, WebRTC will send us this as a candidate object: [, 531, udp]. It’s not an exact thing, just a scheme. If we’re inside the p2 node, the candidate will change to [, 531, udp]. P1 will receive p2’s IP and PORT through a signaling mechanism and will be able to connect to p2 directly. In fact, p1 will send data to, hoping that it will reach p2. Whether that address is owned by p2 or an intermediary, not important. What is important is that the data will be sent to this address and will be able to reach p2.

While the nodes are inside the same network, everything is a piece of cake, as every node only has one candidate object (their own, which is their placement in the network). But the number of candidates will grow by a lot if the nodes are in different networks.

Let’s take a look at a more complicated case. One node is behind a router (NAT), and the 2nd node is in the same network as that router (for example, on the internet).

This case has its own solution. A home router usually has a NAT table. This mechanism is created for the nodes inside a private router network to communicate with, for example, websites.

Let’s assume that a web-server is connected to the internet directly, meaning it has a public IP. Let it be the p2 node. Then, the p1 node (web client) sends a request to the address. First, the data arrives at the r1 router or, to be precise, to its internal interface After that, the router memorizes the source address (p1) and puts it in the NAT table. Then, the router changes the source address to its own (p1 -> r1). Then, using its external interface, the router sends the data to the p2 web server. The web server processes the data generates an answer which it sends back to the router. When the router receives the data, it checks the NAT table and sends the data over to the p1 node. The router here is an intermediary.

Well, what if several nodes from the internal network send a request to the external network? How does a router realize where to send the answer? This problem is solved with the help of ports. When the router substitutes the node address with its’ own, it also substitutes the port. If two nodes request the internet, then the router substitutes their source ports to different ones. Then, when the packet from the web server returns to the router, the router will understand the recipient of the packet by the port. The example is down below.

Going back to the WebRTC and the part where it uses an ICE protocol (hence Ice candidates). The p2 node has one candidate (its placement inside the network,, and the p1 node that is with the router with NAT, has 2 candidates: local ( and a router candidate ( The first one isn’t of much use here, however, it is being generated as WebRTC knows nothing about a remote node – it can be within the same network or not. The second candidate is useful and as we know, the port will have an important role to get through NAT.

The entry in the NAT table is generated only when the data leaves the internal network. That’s why the p1 node has to send its data first, and only then can the data from p2 reach p1.

Actually, both nodes will be behind NAT. To create an entry in every router’s NAT table, nodes have to send something to a remote node, but this time none will be able to reach the other. That’s because nodes don’t know their external IP addresses, and sending data to the internal addresses is pointless.

However, if external addresses are known, the connection will be easily established. If the first node sends the data to the second node router, the router will ignore the data as its NAT table is empty at that moment. However, the first node router has got an entry in the NAT table. Now, as soon as the 2nd node sends the data to the first node router, the router will successfully send to the 1st node. Now, the NAT table of the 2nd router has the needed data.

The problem is, to find out an external IP, we need a node that is inside a public network. In order to deal with this problem, additional servers are used, that are connected to the internet directly. They also help create those entries in the NAT table.

STUN and TURN servers

Available STUN and TURN servers must be mentioned upon a WebRTC initialization, and we’ll be calling them ICE servers from now on. If the servers aren’t mentioned, only nodes from the same network will be able to connect (those that are connected to the network without NAT). It’s important to mention that 3g networks require you to use TURN servers to be operational.

The STUN server is a server on the internet that sends a return address (source address of the node) back. The node behind the router communicates with a STUN server to bypass NAT. A packet that arrived at the STUN server contains a source address. It is a router address, in other words, an external address of our node. This is the address that a STUN server returns. Therefore, a node receives its external IP and port that makes him available in the network. Then, WebRTC creates an additional candidate with this address (external router address and port). Now the NAT table has an entry that allows the packets that are sent to the router via a correct port, to our node.

A STUN server example: how it works

The STUN server will be s1. Router and node stand as r1 and p1 respectively. We will also need to look after a NAT table, let’s make it r1_nat. In that table there is usually a lot of entries from different subnetwork nodes – we won’t mention them.

Let’s start with an empty r1-nat:

Internal IPInternal PORTExternal IPExternal PORT

There are 4 columns in the table. It gives each column from the first two (IP, PORT), their couple from the last two (IP, PORT).

P1 sends a packet to s1. We see four interesting fields in the table down below, they’re in the title of a transport packet (TCP or UDP) – IP and PORT of the source and receiver. Let us imagine that these are the addresses.


P1 sends this packet to r1. The router will need to substitute the address of a source Src IP, as the address that’s mentioned in the packet won’t work for an external network. Furthermore, addresses from that range are reserved, and there is no address on the internet that has that address. The router substitutes the packet and creates a new entry in r1_nat. That’s why it needs to come up with a port number. Since different nodes within a subnetwork can call out to an external network, the NAT table has to contain additional information, so that the router can determine, what node is the recipient for the return packet from the server. Let’s imagine that the router created a port 888.

The changed packet heading:

Src IPSrc PORTDest IPDest PORT – router’s external address.


Internal IPInternal PORTExternal IPExternal PORT

IP address and a subnetwork port are the same as in the initial packet. Actually, sending it back, we need to have a way to completely restore them. IP for the external network is a router address, and the port will change to one created by the router.

An actual port, to which node p1 accepts connection is, indeed, 35777, but the server sends data to a dummy port 888. It will be later changed to the real one, 35777.

So, the router has substituted an address and a port of the source in the packet heading and added an entry to the NAT  Now the packet is sent via the network to the server – to the s1 node. S1 has a packet like this upon entrance:


So, a STUN server knows that it received a packet from The server sends this address back now. It’s worth stopping here for a bit and look once again at this.

The tables above are a piece from a packet heading, not from its content. We haven’t discussed the content since it’s not so important – it’s described in the STUN protocol. Now, however, we also will be looking at the content. It will be simple and will contain the router address –, despite us taking it from the packet heading. It’s not done often as protocols don’t usually care about node addresses. The only important thing is that the packets are delivered as intended. But here we are looking at a protocol that establishes a path between the two nodes.

Now we got the 2nd packet which goes backward:


The heading has changed because the source and the receiver swapped places which is logical as the packet’s destination is different now.


This is the content of the packet. Actually, it could contain a lot of information. But only what’s important for understanding how the STUN server works is mentioned here.

Then the packet travels throughout the network unless it ends up on the external interface of r1. The router understands that the packet isn’t meant for him. How? It can be determined by the port. Port 888 isn’t used by the router for its own purpose but for the NAT mechanism. That’s why the router is looking at that table. It also looks at the External PORT column and searches for a row that matches with the Dest PORT from the arriving packet, which is 888.

Internal IPInternal PORTExternal IPExternal PORT

We’re lucky that this row exists. If we weren’t so lucky, the packet would be dropped away. Now we need to understand, to what subnetwork node to send the packet. Don’t hurry, let’s remember how important ports are in this mechanism. Two nodes in the subnetwork could be sending requests to an external network. Then, if the router created port 888 for the first node, it created port 889 for the 2nd one. Let’s assume that that’s the case, and r1_nat looks like this:

Internal IPInternal PORTExternal IPExternal PORT

We can understand by port 888 that the needed internal address is The router changes that receiver’s address from




The packet successfully reaches node r1 and, upon looking at the packet content, the node finds out its external IP address – its address in the external network. It also knows the port that it makes way for through NAT.

What’s next? How is it useful? The usefulness lies within the entry to table r1_nat. If anyone sends a packet with port 888 to r1, the packet will be redirected to p1. Thus, a narrow way to a hidden node p1 is created.

From the example above you can imagine how NAT and STUN server work. Actually, ICE and STUN/TURN servers are there to bypass NAT restrictions.

Between node and a server, there can be several routers. In case the node receives the address of the router that is the first in the same network as the server. In other words, we’ll receive an address for the router that’s connected to the STUN server. It is exactly what we need for the p2p communication if we keep in mind the fact that each router will be updated with an important row in the NAT table. That’s why the way back will be as smooth as silk.

TURN server is an upgraded STUN server, therefore each TURN server can work as a STUN server. However, there are advantages to the TURN server. If a p2p communication is impossible (in 3g networks), the server becomes a relay and starts working as an intermediary. Of course, p2p is out of the question then, but outside of the ICE mechanism, nodes think that they have direct interaction.

When is the TURN server a must? Why is the STUN server not enough? Because there are different kinds of NAT. They substitute IP address and a port in the same manner, but some of them have embedded falsification protection. For example, in the symmetrical NAT table, two more parameters are stored, IP, and a remote node port. A packet from an external network goes through NAT to the internal network only when the address and port of the source match those mentioned in the table. That’s why the trick with the STUN server doesn’t work out NAT table stores the address and port of the STUN server. When the router receives a packet from a WebRTC interlocutor, it drops it off as it deems the packet falsified. The packet has arrived not from the STUN server.

Therefore, the TURN server is needed when the two interlocutors are behind a symmetric NAT (everyone’s behind their own)


Media stream

  • Video and audio data are packed into media streams
  • Media streams synchronize media tracks that they consist of
  • Different media streams aren’t synced between themselves
  • Media streams can be either local or remote. Local ones are in charge for camera and microphone, whereas remote ones receive data from the network as a code
  • There are two types of media tracks: for video and for audio
  • Media tracks can be turned on or off
  • Media tracks consist of media channels
  • Media tracks synchronize the media channels they consist of
  • Media streams and media tracks have marks that help to distinguish them from one another

Session descriptor

  • Session descriptor is used for a logical connection of two nodes within a network
  • Session descriptor stores information about available ways to code audio and video data
  • WebRTC uses an external signaling mechanism. Transferring session descriptors (SDP) becomes an application’s task
  • Mechanism of logical connection consists of two steps: offer and answer
  • Session descriptor generation is impossible without using a local media stream with an offer. It’s also impossible without using a remote session descriptor with an answer
  • A received descriptor must be given to WebRTC realization, regardless of whether this descriptor was received remotely or locally from the same WebRTC realization
  • There is also an opportunity to slightly change session descriptor


  • Ice candidate is a node’s address within a network
  • The address can be own, router’s or the TURN server’s
  • There are many candidates
  • A candidate consists of an IP address, port and a transport type (TCP or UDP)
  • Candidates are used to establish a physical connection between two nodes within a network
  • Candidates need to be sent via a signaling mechanism
  • Only remote candidates should be transferred to a WebRTC realization
  • In some WebRTC realizations, candidates can be sent only after a session descriptor is installed


  • NAT is a mechanism that allows access to an external network
  • Home routers support a special NAT table
  • Routers substitute addresses in packets. The source address becomes their own if the packet goes to an external network, and the source address becomes a node address within the internal network if the packet arrived from an external network.
  • NAT uses ports to allow multi-channel access to an external network
  • ICE is a mechanism to bypass NAT
  • STUN and TURN servers help to bypass NAT
  • STUN server allows the creation of obligatory entries in a NAT table and returns an external node address
  • TURN server generalizes STUN mechanism and makes it always working
  • In the worst-case scenarios, a TURN server is used as a relay, so a p2p turns into a client-server-client connection

WebRTC in iOS

You’ve probably heard of WebRTC if you wanted to create an online conference app or introduce calls to your application. There’s not much info on that technology, and even those little pieces that exist are developer-oriented. So we aren’t going to dive into the tech part but rather try to understand what WebRTC is.

WebRTC in brief

WebRTC (Web Real Time Communications) is a protocol that allows audio and video transmission in realtime. It works with both UDP and TCP and can switch between them. One of the main advantages of this protocol is that it can connect users via a p2p connection, thus it transmits directly while avoiding servers. However, to use p2p successfully, one must understand p2p’s and WebRTC’s peculiarities.

You can read more in-depth information about WebRTC here

Stun and Turn

Networks are usually designed with private IP addresses. These addresses are used within organizations for systems to be connected locally and they aren’t routed in the Internet. In order to allow a device with a private IP to contact devices and resources outside the local network, a private address must be translated to a publicly accessible address. NAT (Network Address Translation) takes care of the process. You can read more about NAT here. We just need to know that there’s a NAT table in the router and that we need a special record in the NAT which allows packets to our client. To create an entry in the NAT table, the client must send something to a remote client. The problem is that neither of the clients knows their external addresses. To deal with this, STUN and TURN servers were invented. You can connect two clients without TURN and STUN, but it’s only possible if the clients are within the same network.

STUN is directly connected to the Internet server. It receives a packet with an external address of the client that sent it and sends it back. The client learns its external address and the port that’s needed for the router to understand which client has sent the packet. That’s because several clients can simultaneously contact the external network from the internal one. That’s how the entry we need ends up in the NAT table.

TURN is an upgraded STUN server. It can work like STUN, but it also has more functions. For example, you will need TURN when NAT doesn’t allow packets sent by a remote client. It happens because there are different types of NAT, and some of them not only remember an external IP, but also a STUN server port. They don’t allow packets received from servers other than STUN. Not only that, it’s also impossible to establish a p2p connection inside 3G networks. In those cases you also need a TURN server which becomes a relay, making clients think that they’re connected via p2p.

Signal server

We know now why we need STUN and TURN servers, but it’s not the only thing about WebRTC. WebRTC can’t send data about connections, which means that we can’t connect clients using only WebRTC. We need to set up a way to transfer the data about connections (what is the data and why it’s needed, we’ll see below). And for that, we need a signal server. You can use any means for data transfer, you only need the opponents to exchange data among themselves. For instance, Fora Soft usually uses WebSockets (you can read about them here).

Video calls one-on-one

Although STUN, TURN, and signal servers have been discussed, it’s still unclear how to create a call. Let’s find out what steps we shall take to organize a video call.

Your iPhone can connect any device via WebRTC. It’s unnecessary for both clients to be related to iPhone, as you can also connect to Android devices or PCs.

We have two clients: a caller and one who’s being called. In order to make a call, a person has to: Receive their local media stream (a stream of video and audio data). Each stream can consist of several media channels. There can be a few media streams: from a camera and a desktop, for example. Media stream synchronizes media tracks, however, media streams can’t be synchronized between each other. Thus, sound and video from the camera will be synchronized with one another but not with the desktop video. Media channels inside the media track are synchronized, too. The code for the local media stream looks like this:

  • Create an offer, as in suggesting a call start.
  • Send their own SDP through the signal server. What is SDP? Devices have a multitude of parameters that need to be considered to establish a connection. For example, a set of codecs that work with the device. All these parameters are formed into an SDP object or a session descriptor that is later sent to an opponent via the signal server. It’s important to note that the local SDP is stored as text and can be edited before it’s sent to the signal server. It can be done to forcefully choose a codec. But it’s a rare occasion, and it doesn’t always work.
  • Send their Ice Candidate through a signal server. What’s Ice Candidate? SDP helps establish a logical connection, but the clients can’t find one another physically. Ice Candidate objects possess information about where the client is located in the network. Ice Candidate helps clients find each other and start exchanging media streams. It’s important to notice that that the local SDP is single, while there are many Ice Candidate objects. That happens because the client’s location within the network can be determined by an internal IP-address, TURN server addresses, as well as an external router address, and there can be several of them. Therefore, in order to determine the client’s location within the network, you need a few Ice Candidate objects.
  • Accept a remote media stream from the opponent and show it. With iOS, OpenGL or Metal can be used as tools for video stream rendering.

The opponent has to complete the same steps while you’re completing yours, except for the 2nd one. While you’re creating an offer, the opponent is proceeding with the answer, as in answers the call.

Actually, answer and offer are the same thing. The only difference is that while a person expecting a call sets up an answer means, while they generate their local SDP, they rely on the caller’s SDP object. To do that, they refer to the caller’s SDP object. Therefore, the clients know about both device parameters and can choose a more suitable codec.

To summarize: the clients first exchange SDPs (establish a logical connection), then Ice Candidates (establish a physical connection). Therefore, the clients connect successfully, they can see, hear, and talk with each other.

That’s not everything one needs to know when working with WebRTC in iOS. If we leave everything as it is at the moment, the app users will be able to talk. However, only if the application is open, will they be able to learn about an incoming call and answer it. The good thing is, this problem can be easily solved. iOS provides us with a VoIP push. It’s a kind of push notification in iOS, and it was created specifically for work with calls. This is how it’s registered:

This push notification helps show an incoming call screen which allows the user to accept or decline the call. It’s done via this function:

It doesn’t matter what the user is doing at the moment. They can be playing a game or having their phone screen blocked. VoIP push has the highest priority, which means that notifications will always be arriving, and the users will be able to easily call one another. VoIP push notifications have to be integrated along with call integration. It’s very difficult to use calls without VoIP because for a call to happen, the users will have to have their apps open and just sit and wait for the call. That can be classified as strange behavior. The users don’t want to act strange, so they’ll probably choose another application.


We’ve discussed some of the WebRTC peculiarities; found out what’s needed for two clients to connect; learned what steps the clients need to take for a call to happen; what to do besides WebRTC integration to allow iOS users to call one another. We hope that WebRTC isn’t a scary and unknown concept for you anymore, and you understand what you need to apply it to your product.