Categories
Uncategorized

How To Implement Screen Sharing in iOS App using ReplayKit and App Extension

Intro

Screen sharing – capturing user’s display and demonstrating it to peers during a video call. 

There’re 2 ways how you can implement screen sharing into your iOS app:

  1. Screen sharing in app. It suggests that a user can only share their screen from one particular app. If they minimize the app window, broadcasting will stop. It’s quite easy to implement.
  2. Screen sharing with extensions. This approach enables screen sharing from almost any point of the OS: e.g. Homescreen, external apps, system settings. But the implementation might be quite time-consuming.

In this article, we’ll share guides on both.

Screen sharing in app

Starting off easy – how to screen share within an app. We’ll use an Apple Framework, ReplayKit.

import ReplayKit

class ScreenShareViewController: UIViewController {

		lazy var startScreenShareButton: UIButton = {
        let button = UIButton()
        button.setTitle("Start screen share", for: .normal)
        button.setTitleColor(.systemGreen, for: .normal)
        return button
    }()
    
    lazy var stopScreenShareButton: UIButton = {
        let button = UIButton()
        button.setTitle("Stop screen share", for: .normal)
        button.setTitleColor(.systemRed, for: .normal)
        return button
    }()

		lazy var changeBgColorButton: UIButton = {
        let button = UIButton()
        button.setTitle("Change background color", for: .normal)
        button.setTitleColor(.gray, for: .normal)
        return button
    }()
    
    lazy var videoImageView: UIImageView = {
        let imageView = UIImageView()
        imageView.image = UIImage(systemName: "rectangle.slash")
        imageView.contentMode = .scaleAspectFit
        return imageView
    }()
}

Here we added it to the ViewController where recording, background color change buttons and imageView are – this is where the captured video will appear later.

the ViewController

To capture the screen, we address the RPScreenRecorder.shared() class and then call startCapture(handler: completionHandler:).

@objc func startScreenShareButtonTapped() {
		RPScreenRecorder.shared().startCapture { sampleBuffer, sampleBufferType, error in
				self.handleSampleBuffer(sampleBuffer, sampleType: sampleBufferType)
            if let error = error {
                print(error.localizedDescription)
            }
        } completionHandler: { error in
            print(error?.localizedDescription)
        }
}

Then the app asks for a permission to capture the screen: 

the permission pop-up

ReplayKit starts generating a CMSampleBuffer stream for each media type – audio or video. The stream contains the media fragment itself – the captured video – and all necessary information. 

func handleSampleBuffer(_ sampleBuffer: CMSampleBuffer, sampleType: RPSampleBufferType ) {
        switch sampleType {
        case .video:
            handleVideoFrame(sampleBuffer: sampleBuffer)
        case .audioApp:
//             handle audio app
            break
        case .audioMic:
//             handle audio mic
            break
        }
    }

The function converted into the UIImage type will then process each generated videoshot and display it on the screen.

func handleVideoFrame(sampleBuffer: CMSampleBuffer) {
        let imageBuffer = CMSampleBufferGetImageBuffer(sampleBuffer)!
        let ciimage = CIImage(cvPixelBuffer: imageBuffer)
        
        let context = CIContext(options: nil)
        var cgImage = context.createCGImage(ciimage, from: ciimage.extent)!
        let image = UIImage(cgImage: cgImage)
        render(image: image)
}

Here’s what it looks like:

generated frames

Captured screen broadcasting in WebRTC 

Usual setting: during a video call one peer wants to demonstrate the other what’s up on their screen. WebRTC is a great pick for it.

WebRTC connects 2 clients to deliver video data without any additional servers – it’s peer-to-peer connection (p2p). Check out this article to learn about it in detail. 

Data streams that clients exchange are media streams that contain audio and video streams. A video stream might be a camera image or a screen image.

To establish p2p connection successfully, configure a local mediastream that will further be delivered to the session descriptor. To do that, get an object of the RTCPeerConnectionFactory class and add the media stream packed with audio and video tracks to it.

func start(peerConnectionFactory: RTCPeerConnectionFactory) {

        self.peerConnectionFactory = peerConnectionFactory
        if self.localMediaStream != nil {
            self.startBroadcast()
        } else {
            let streamLabel = UUID().uuidString.replacingOccurrences(of: "-", with: "")
            self.localMediaStream = peerConnectionFactory.mediaStream(withStreamId: "\\(streamLabel)")
            
            let audioTrack = peerConnectionFactory.audioTrack(withTrackId: "\\(streamLabel)a0")
            self.localMediaStream?.addAudioTrack(audioTrack)

            self.videoSource = peerConnectionFactory.videoSource()
            self.screenVideoCapturer = RTCVideoCapturer(delegate: videoSource!)
            self.startBroadcast()
            
            self.localVideoTrack = peerConnectionFactory.videoTrack(with: videoSource!, trackId: "\\(streamLabel)v0")
            if let videoTrack = self.localVideoTrack  {
                self.localMediaStream?.addVideoTrack(videoTrack)
            }
            self.configureScreenCapturerPreview()
        }
    }

Pay attention to the video track configuration:

func handleSampleBuffer(sampleBuffer: CMSampleBuffer, type: RPSampleBufferType) {
        if type == .video {
            guard let videoSource = videoSource,
                  let screenVideoCapturer = screenVideoCapturer,
                  let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }
            
            let width = CVPixelBufferGetWidth(pixelBuffer)
            let height = CVPixelBufferGetHeight(pixelBuffer)
            videoSource.adaptOutputFormat(toWidth: Int32(width), height: Int32(height), fps: 24)
            
            let rtcpixelBuffer = RTCCVPixelBuffer(pixelBuffer: pixelBuffer)
            let timestamp = NSDate().timeIntervalSince1970 * 1000 * 1000
            
            let videoFrame = RTCVideoFrame(buffer: rtcpixelBuffer, rotation: RTCVideoRotation._0, timeStampNs: Int64(timestamp))
            videoSource.capturer(screenVideoCapturer, didCapture: videoFrame)
        }
}

Screen sharing with App Extension

Since iOS is a quite closed and highly protected OS, it’s not that easy to address storage space outside an app. To let developers access certain features outside an app, Apple created App Extensions – external apps with access to certain relationships in iOS. They operate according to their types. App Extensions and the main app (let’s call it Containing app) don’t interact with each other directly but can share a data storing container. To ensure that, create an AppGroup on the Apple Developer website, then link the group with the Containing App and App Extension. 

containing app and extension relation
Scheme of data exchange between entities

Now to devising the App Extension. Create a new Target and select Broadcast Upload Extension. It has access to the recording stream and its further processing. Create and set up the App Group between targets. Now you can see the created folder with App Extension. There’re Info.plist, the extension file, and the swift SampleHandler file. There’s also a class with the same name written in SampleHandler that the recorded stream will process. 

The methods we can operate with are already written in this class as well: 

override func broadcastStarted(withSetupInfo setupInfo: [String : NSObject]?)
override func broadcastPaused() 
override func broadcastResumed() 
override func broadcastFinished()
override func processSampleBuffer(_ sampleBuffer: CMSampleBuffer, with sampleBufferType: RPSampleBufferType)

We know what they’re responsible for by their names. All of them, but the last one. It’s where the last CMSampleBuffer and its type arrives. In case the buffer type is .video, this is where the last shot will be.

Now let’s get to implementing screen sharing with launching iOS Broadcast. To start off, we demonstrate the RPSystemBroadcastPickerView itself and set the extension to call.

let frame = CGRect(x: 0, y: 0, width: 60, height: 60)
let systemBroadcastPicker = RPSystemBroadcastPickerView(frame: frame)
systemBroadcastPicker.autoresizingMask = [.flexibleTopMargin, .flexibleRightMargin]
if let url = Bundle.main.url(forResource: "<OurName>BroadcastExtension", withExtension: "appex", subdirectory: "PlugIns") {
    if let bundle = Bundle(url: url) {
           systemBroadcastPicker.preferredExtension = bundle.bundleIdentifier
     }
}
view.addSubview(systemBroadcastPicker)

Once a user taps “Start broadcast” the broadcast will start and the selected extension will process the state and the stream itself. But how will the Containing App know this? Since the storage container is shared, we can exchange data via the file system – e.g. UserDefaults(suiteName) and FileManager. With it we can set up a timer, check up the states within certain periods of time, record and read data along a certain track. An alternative to that is to launch a local web-socket server and address it. But in this article we’ll only cover exchanging via files.

Write the BroadcastStatusManagerImpl class that will record current broadcast status as well as communicate to the delegate on status alterations. We’ll check on the updated info using a timer with 0.5 sec frequency. 

protocol BroadcastStatusSubscriber: AnyObject {
    func onChange(status: Bool)
}

protocol BroadcastStatusManager: AnyObject {
    func start()
    func stop()
    func subscribe(_ subscriber: BroadcastStatusSubscriber)
}

final class BroadcastStatusManagerImpl: BroadcastStatusManager {

    // MARK: Private properties

    private let suiteName = "group.com.<YourOrganizationName>.<>"
    private let forKey = "broadcastIsActive"

    private weak var subscriber: BroadcastStatusSubscriber?
    private var isActiveTimer: DispatchTimer?
    private var isActive = false

    deinit {
        isActiveTimer = nil
    }

    // MARK: Public methods

    func start() {
        setStatus(true)
    }

    func stop() {
        setStatus(false)
    }

    func subscribe(_ subscriber: BroadcastStatusSubscriber) {
        self.subscriber = subscriber
        isActive = getStatus()

        isActiveTimer = DispatchTimer(timeout: 0.5, repeat: true, completion: { [weak self] in
            guard let self = self else { return }

            let newStatus = self.getStatus()

            guard self.isActive != newStatus else { return }

            self.isActive = newStatus
            self.subscriber?.onChange(status: newStatus)
        }, queue: DispatchQueue.main)

        isActiveTimer?.start()
    }

    // MARK: Private methods

    private func setStatus(_ status: Bool) {
        UserDefaults(suiteName: suiteName)?.set(status, forKey: forKey)
    }

    private func getStatus() -> Bool {
        UserDefaults(suiteName: suiteName)?.bool(forKey: forKey) ?? false
    }
}

Now we create samples of BroadcastStatusManagerImpl to the App Extension and the Containing App, so that they know the broadcast state and record it. The Containing App can’t stop the broadcast directly. This is why we subscribe to the state – this way, when it reports false, App Extension will terminate broadcasting, using the finishBroadcastWithError method. Even though, in fact, we end it with no error, this is the only method that Apple SDK provides for program broadcast termination. 

extension SampleHandler: BroadcastStatusSubscriber {
    func onChange(status: Bool) {
        if status == false {
            finishBroadcastWithError(NSError(domain: "<YourName>BroadcastExtension", code: 1, userInfo: [
                NSLocalizedDescriptionKey: "Broadcast completed"
            ]))
        }
    }
}

Now both apps know when the broadcast started and ended. Then, we need to deliver data from the last shot. To do that, we create the PixelBufferSerializer class where we declare the serializing and deserializing methods. In the SampleHandler’s processSampleBuffer method we convert CMSampleBuffer to CVPixelBuffer and then serialize it to Data. When serializing to Data it’s important to record the format type, height, width and processing increment for each surface in it. In this particular case we have two of them: luminance and chrominance, and their data. To get the buffer data, use CVPixelBuffer-kind functions.

While testing from iOS to Android we’ve faced this problem: the device just wouldn’t display the screen shared. It’s that Android OS doesn’t support the irregular resolution the video had. We’ve solved it by just turning it into 1080×720. 

Once having serialized into Data, record the link to the bytes gained into the file.

memcpy(mappedFile.memory, baseAddress, data.count)

Then create the BroadcastBufferContext class in the Containing App. Its operation logic is alike BroadcastStatusManagerImpl: the file discerns each timer iteration and reports on the data for further processing. The stream itself comes in 60 FPS, but it’s better to read it with 30 FPS, since the system doesn’t perform well when processing in 60 FPS due to lack of the resource. 

func subscribe(_ subscriber: BroadcastBufferContextSubscriber) {
        self.subscriber = subscriber

        framePollTimer = DispatchTimer(timeout: 1.0 / 30.0, repeat: true, completion: { [weak self] in
            guard let mappedFile = self?.mappedFile else {
                return
            }

            var orientationValue: Int32 = 0
            mappedFile.read(at: 0 ..< 4, to: &orientationValue)
            self?.subscriber?.newFrame(Data(
                bytesNoCopy: mappedFile.memory.advanced(by: 4),
                count: mappedFile.size - 4,
                deallocator: .none
            ))
        }, queue: DispatchQueue.main)
        framePollTimer?.start()
    }

Deserialize it all back to CVPixelBuffer, likewise we serialized it but in reverse. Then we configure the video track by setting up the extension and FPS.

videoSource.adaptOutputFormat(toWidth: Int32(width), height: Int32(height), fps: 60)

Now add the frame RTCVideoFrame(buffer: rtcpixelBuffer, rotation: RTCVideoRotation._0, timeStampNs: Int64(timestamp)). This track goes to the local stream.

localMediaStream.addVideoTrack(videoTrack)

Conclusion 

Implementing screen sharing in iOS is not that easy as it may seem. Reservedness and security of the OS force developers into looking for workarounds to deal with such tasks. We’ve found some – check out the result in our Fora Soft Video Calls app. Download on AppStore.

Categories
Uncategorized

How to Choose a Live Streaming Protocol? RTMP vs HLS vs WebRTC

cover image

When streaming a video, the delivery mostly depends on which live streaming protocol is used. Each protocol performs differently in terms of latency, scalability, and supporting devices. While there’s no universal solution to each live streaming need, you still have freedom to pick the best match for your specific requirements.

How does streaming work?

For the end user, live streaming is all about a quasi-real experience with a screen and a camera standing in-between them and the source. In reality, there’s much more backstage. 

It does start with capturing whatever is about to go live on camera. It does end with playing the content on an end-user device. But to make this happen, it takes four more milestones to complete.

  1. Once having recorded even a byte of information, video and audio get compressed with an encoder. In short, encoder converts raw media material into a piece of digital information that can be displayed on many devices. It also reduces the file size from GBs to MBs to ensure quick data delivery and playback.
  2. Compressed data then gets ingested into a streaming platform for further processing. 
  3. The platform resource transcodes the stream and packages it into its final format and protocol. The latter is basically the way data travels from one communicating system to another. 
  4. To get to its end user, the stream is then delivered across the internet, in most cases via a content delivery network (CDN). In turn, CDN is a system of servers located in different physical locations.

Aaand — cut! Here’s where a user sees the livestream on their device. A long way to go, huh? But in fact, this journey may take less than 0,5 second or 1-2 minutes. This is called latency and it varies from one protocol to another. As well as many other parameters. 

There’re quite a few live streaming protocols to choose from, but we’ll elaborate on three most commonly used ones: RTMP, HLS and WebRTC. In short, here’s the difference: 

protocols comparison
Live Streaming protocols (RTMP, HLS, WebRTC) comparison

Now in detail.

RTMP

Real-Time Messaging Protocol (RTMP) is probably the most widely supported media streaming protocol from both sides of the process. It’s compatible with many professional recording devices and also quite easy to ingest into streaming platforms like YouTube, Twitch, and others. 

Supporting low-latency streaming, RTMP delivers data at roughly the same pace as cable broadcasting — it takes only 5 seconds to transmit information. This is due to using a firehose approach that builds a steady stream of available data and delivers it to numerous users in real time. It just keeps flowing!

Yet, this video streaming protocol is no longer supported by browsers. This results in additional conversion and transcoding the stream into a HTTP-based technology, prolonging the overall latency to 6-30 seconds. 

HLS

HTTP Live Streaming, developed by Apple as a part of QuickTime, Safari, OS X, and iOS software, is a perfect pick if you stream to a large audience, even millions at a time. HLS is supported by all browsers and almost any device (set-top boxes, too). 

The protocol supports adaptive bitrate streaming which provides the best video quality no matter the connection, software, or device. Basically it’s key to the best viewer experience.

Usually, to cover a large audience with broad geography stably and with the lowest latency, CDN is used. 

The only major drawback of HLS seems to be the latency — prioritizing quality, it may reach 45 seconds when used end-to-end. 

Apple presented the solution in 2019: Low Latency HLS shrinks the delay to less than 2 seconds and is currently supported by all browsers, as well as Android, Linux, Microsoft and, obviously, macOS devices, several set-top boxes and Smart TVs. 

WebRTC

RTC in WebRTC stands for “real-time communication” which suggests this protocol is perfect for an interactive video environment. With the minimum latency possible (less than 0,5 second) WebRTC is mostly used for video conferencing with not so many participants (usually 6 or less). We use it a lot in our projects for video conferencing. Check out the ProVideoMeeting or CirrusMED portfolios.

But media servers like Kurento allow WebRTC scaling to an audience of up to 1 million viewers. Yet, a more popular approach would be to use WebRTC as an ingest and repackage it into more scalable HLS. 

Apart from providing the absolute minimum latency possible, WebRTC doesn’t require any additional equipment for a streamer to do the recording. 

Conclusion

Thus, when choosing a live streaming protocol for your streaming goals, ask yourself these questions:

  • What kind of infrastructure am I willing to establish or already have? What recording devices and software are at my disposal?
  • What kind of content I want to deliver and which latency is acceptable? How “live” do I want the streaming to be?
  • What system scalability do I expect? How many users I guesstimate will watch the stream at a time? 

One protocol may not satisfy all your needs. But a good mix of two can. Hit us up to get a custom solution to turn your ideas into action.

Categories
Uncategorized

Video Conferencing System Architecture: P2P vs MCU vs SFU ?

Even though WebRTC is a protocol developed to do one job, to establish low-ping high-security multimedia connections, one of its best features is flexibility. Even if you are up for a pretty articulated task of creating a video conferencing app, the options are open.

You can go full p2p, deploy a media server backend (there is quite a variety of those), or combine these approaches to your liking. You can pick desired features and find a handful of ways to implement them. Finally, you can freely choose between a solid backend or a scalable media server grid created by one of the many patterns. 

With all of these freedoms at your disposal, picking the best option might – and will – be tricky. Let us clear the fight P2P vs MCU vs SFU a bit for you.

What is P2P?

Let’s imagine it’s Christmas. Or any other mass gift giving opportunity of your choice. You’ve got quite a bunch of friends scattered across town, but everyone is too busy to throw a gift exchange party. So, you and your besties agree everyone will get their presents once they drop by each other.

When you want to exchange something with your peers, no matter whether it’s a Christmas gift or a live video feed, the obvious way is definitely peer to peer. Each of your friends comes to knock on your door and get their box, and then you visit everyone in return. Plain and simple.

For a WebRTC chat, that means all call parties are directly connected to each other, with the host only serving as a meeting point (or an address book, in our Christmas example).

This pattern works great, while:

  • your group is rather small
  • everyone is physically able to reach each other

Every gift from our example requires some time and effort to be dealt with: at least, you have to drive to a location (or to wait for someone to come to you), open the door, give the box and say merry Christmas.

  • If there are 4 members in a group, each of you needs time to handle 6 gifts – 3 to give, 3 to take.
  • When there are 5 of you, 8 gifts per person are to be taken care of.
  • Once your group increases to 6 members, your  Christmas to-do list now features 10 gifts.

At one point, there will be too many balls in the air: the amount of incoming and outgoing gifts will be too massive to handle comfortably.

Same for video calls: every single P2P stream should be encoded and sent or decoded and displayed in real time  – each operation requiring a fraction of your system’s performance, network bandwidth, and battery capacity. This fraction might be quite sensible for higher-quality video: if a 2-on-2 or even a 5-on-5 conference will work decently on any relatively up-to-date device, 10-on-10 peer to peer FullHD  call would eat up give or take 50 Mbps in bandwidth and put quite of a load even on a mid-to-high tier CPU.

p2p architecture
Peer-to-peer architecture 

Now regarding the physical ability to reach. Imagine one of your friends has recently moved to an upscale gated community. They are free to drive in and out – so, they’ll get to your front door for their presents, but your chances to reach their home for your gift are scarce.

WebRTC-wise, we are talking about corporate networks with NATs and\or VPNs. You can reach most hosts from inside the network, while being as good as unreachable from outside. In any case, your peers might be unable to see you, or vice versa, or both.

And finally – if all of you decide to pile up the gifts for a fancy Instagram photo, everyone will have to create both the box heap and the picture themselves: the presents are at the recipients’ homes.

WebRTC: peer-to-peer means no server side recording (or any other once-per-call features). At all.

Peer-to-Peer applications examples

Google Meet and secure mobile video calling apps without any server-side features like video recording.

That’s where media servers come to save the day.

WebRTC media servers: MCU and SFU

Back to the imaginary Christmas. Your bunch of friends is huge, so you figure out you’ll spend the whole holiday season waiting for someone or driving somewhere. To save your time, you pay your local coffee shop to serve as a gift distribution node. 

From now on, everyone in your group needs to reach a single location to leave or get gifts – the coffee shop.

That’s how the WebRTC media servers work. They accept calling parties’ multimedia streams and deliver them to everyone in a conference room. 

A while ago, WebRTC media servers used to come in two flavors: SFU (selective forwarding unit) and MCU (Multipoint conferencing \\ Multipoint control unit). As of today, most commercial and open-source solutions offer both SFU and MCU features, so both terms now describe features and usage patterns instead of product types.

What are those?

SFU / Selective Forwarding Unit

What SFU is?

SFU sends separate video streams of everyone to everyone.

The bartender at the coffee shop keeps track of all the gifts arriving at the place, and calls their recipients if there’s something new waiting for them. Once you receive such a call, you drop by the shop, have a ‘chino, get your box and head back home.

The bad news is: the bartender gives you calls about one gift at a time. So, if there are three new presents, you’ll have to hit the road three times in a row. If there are twenty… you probably got the point. Alternatively, you can visit the place at a certain periodicity, checking for new arrivals yourself.

Also, as your gift marathon flourishes, coffee quality degrades: the more people join in, the more time and effort the bartender dedicates to distributing gifts instead of caffeine. Remember: one gift – one call from the shop.

Media Server working as a Selective Forwarding Unit allows call participants to send their video streams once only – to the server itself. The backend will clone this stream and deliver it to every party involved in a call.

With SFU, every client consumes almost two times less bandwidth, CPU capacity, and battery power than it would in a peer-to-peer call:

  • for a 4-user call: 1 outgoing stream, 3 incoming (instead of 3 in, 3 out for p2p)
  • for a 5-user call: 1 outgoing stream, 4 incoming (would be 4 and 4 in p2p)
  • for a 10-user call: 1 out, 9 in (9 in, 9 out – p2p)
sfu architecture
SFU architecture

The drawback kicks in with users per call ratio approaching 20. “Selective” in SFU stands for the fact that this unit doesn’t forward media in bulk – it delivers media on a per-request basis. And, since WebRTC is always a p2p protocol, even if there is a server involved, every concurrent stream is a separate connection. So, for a 10-user video meetup a server has to maintain 10 ingest (“video receiving”) and 90 outgoing connections, each requiring computing power, bandwidth and, ultimately, money. But…

SFU Scalability

Once the coffee shop owner grows angry with the gift exchange intensity, you can take the next step, and pay some more shops in the neighborhood to join in. 

Depending on a particular shop’s load, some of the gift givers or receivers can be routed to another, less crowded one. The grid might grow almost infinitely, since every shop can either forward a package to their addressee, or to an alternative pick up location.

Forwarding rules are perfectly flexible. Like, Johnson’s coffee keeps gifts for your friends with first names starting A – F, and Smartducks is dedicated to parcels for downtown residents, while Randy’s Cappuccino will forward your merry Christmas to anyone who sent their own first gift since last Thursday.

One stream – one connection approach provided by SFU pattern has a feature to beat almost all of its cons. The feature is scalability software development.

Just like you forward a user’s stream to another participant, you can forward it to another server. With this in mind, the back end WebRTC architecture can be adjusted to grow and shrink depending on the number of users, conferences and traffic intensity.

E.g., if too many clients request a particular stream from one host, you can spawn a new one, clone the stream there and distribute it from a new unoccupied location.

Or, in case if you expect rush entrance on a massive conference (e.g., some 20-30 streaming users, and hundreds of view-only subscribers), you can assign two separate media server groups, one to handle incoming streams and the other to deliver it to subscribers. In this case, any load spikes on the viewing side will have zero effect on video ingest, and vice versa.

SFU applications examples

Skype and almost every other mobile messenger with a video conference and video call recording capabilities employs SFU pattern on the backend.

Receiving other users’ video as separate streams provides capabilities for adaptive UX, allows per-stream quality adjustment and improves overall call stability in a volatile environment of a cellular network.

MCU / Multipoint Conferencing Unit

What MCU is?

MCU unites all streams into 1 and sends just 1 stream to each participant.

Giving gifts is making friends, right? Now almost everyone in town is your buddy and participates in this gift exchange. The coffee shop hosting the exchange comes up with a great idea: why don’t we put all the presents for a particular person in a huge crate with their name on it. Moreover, some holiday magic is now involved: once there are new gifts for anyone, they appear in their respective boxes on their own.

Still, making Christmas magic seems to be harder work than making coffee: they might even need to hire more people to cast spells on the gift crates. And even with extra wizards on duty, there is zero chance you can rearrange the crates’ content order for your significant other to see your gift first – everyone gets the same pattern.

MCU architecture

Well, some of the MCU-related features do really ask for puns over an acronym shared with Marvel Cinematic Universe. Something marvelous is definitely involved. Media server in an MCU role has to keep only 20 connections for a 10 user conference, instead of 100 links of an SFU – one ingest, one output per user. How come? It merges all the videos and audios a user needs to receive into a single stream, and delivers it to a particular client. That’s how Zoom’s conferences are made: with MCU, even a lower-tier computer is capable of handling a 100-user live call.

Magic obviously comes at a price, though. Compositing multiple video and audio streams in real time is *much* more of a performance guzzler than any forwarding pattern. Even more, if you have to somehow exclude one’s own voice and picture from the merged grid they receive – for each of the users. 

Another drawback, though mitigable, is that a composited grid is the same for everyone who receives the video – no matter what is their screen resolution, or aspect ratio, or whatever. If there is a need to have different layouts for mobile and desktop devices – you’ll have to composite the video twice.

MCU scalability

In WebRTC video call, compared to SFU, MCU pattern has considerably less scaling potential: video compositing with sub-second delays does not allow on-the-fly load redistribution for a particular conference. Still, one can autospawn additional server instances for new calls in a virtualized environment, or, for even better efficiency, assign an additional SFU unit to redistribute composited video.

MCU applications examples

Zoom and a majority of its alternatives for massive video conferencing run off MCU-like backends. Otherwise, WebRTC video calls for 25+ participants would only be available for high-end devices.

TL;DR: what and when do I use?

Quick comparison: P2P, SFU, MCU or MCU + SFU

~1-4 users per call – P2P

Pros:

  • lowest idling costs
  • easiest scaling
  • shortest TTM (time to market)
  • potentially the most secure

Cons:

  • for 5+ user calls – quality might deteriorate on weaker devices
  • highest bandwidth usage (may be critical for mobile users)
  • no server side recording, video analytics or other advanced features

Applications:

  • private \ group calls
  • video assistance and sales

5-20 users per call – SFU

Pros:

  • easily scalable with simultaneous call number growth
  • retains UX flexibility while providing server side features
  • can have node redundancy by design: thus, most rush-proof

Cons:

  • pretty traffic- and performance intensive on the client side
  • might still require a compositing MCU-like service to record calls 

Use cases:

  • E-learning: workshops and virtual classrooms
  • Corporate communications: meeting and pressrooms

20+ users per call – MCU \ MCU + SFU

Pros:

  • least load on client side devices
  • capable of serving the biggest audiences
  • easily recordable (server side / client side)

Cons:

  • biggest idling and running costs
  • one call capacity is limited to performance of a particular server
  • least customizable layout

Use cases:

  • Large event streaming
  • Social networking
  • Online media 

Conclusion

P2P, MCU, and SFU are parts of WebRTC. You can read more about WebRTC on our blog:

How to minimize latency to less than 1 sec for mass streams?
WebRTC in Android.
WebRTC security in plain language for business people.

Got another question not covered here? Feel free to contact us using this form, and our professionals will be happy to help you with everything.

Categories
Uncategorized

Creating a WebRTC connection on Android. Guide For Beginners

webrtc in android

Briefly about WebRTC
2 ways to implement video communication with WebRTC on Android
Creating a connection
Logical connection
Physical connection
Stun server
Turn server
Receiving data
Sending data
Many-to-many
Conclusion

Briefly about WebRTC

WebRTC is a video chat and conferencing development technology. It allows you to create a peer-to-peer connection between mobile devices and browsers to transmit media streams. You can find more details on how it works and its general principles in our article about WebRTC in plain language.

2 ways to implement video communication with WebRTC on Android

  • The easiest and fastest option is to use one of the many commercial projects, such as Twilio or LiveSwitch. They provide their own SDKs for various platforms and implement functionality out of the box, but they have drawbacks. They are paid and the functionality is limited: you can only do the features that they have, not any that you can think of.
  • Another option is to use one of the existing libraries. This approach requires more code but will save you money and give you more flexibility in functionality implementation. In this article, we will look at the second option and use https://webrtc.github.io/webrtc-org/native-code/android/ as our library.

Creating a connection

Two steps in creating a WebRTC connection

Creating a WebRTC connection consists of two steps: 

  1. Establishing a logical connection – devices must agree on the data format, codecs, etc.
  2. Establishing a physical connection – devices must know each other’s addresses

To begin with, note that at the initiation of a connection, to exchange data between devices, a signaling mechanism is used. The signaling mechanism can be any channel for transmitting data, such as sockets.

Suppose we want to establish a video connection between two devices. To do this we need to establish a logical connection between them.

A logical connection

A logical connection is established using the Session Description Protocol (SDP), for this one peer:

Creates a PeerConnection object.

Forms an object on the SDP offer, which contains data about the upcoming session, and sends it to the interlocutor using a signaling mechanism. 

val peerConnectionFactory: PeerConnectionFactory
lateinit var peerConnection: PeerConnection

fun createPeerConnection(iceServers: List<PeerConnection.IceServer>) {
  val rtcConfig = PeerConnection.RTCConfiguration(iceServers)
  peerConnection = peerConnectionFactory.createPeerConnection(
      rtcConfig,
      object : PeerConnection.Observer {
          ...
      }
  )!!
}

fun sendSdpOffer() {
  peerConnection.createOffer(
      object : SdpObserver {
          override fun onCreateSuccess(sdpOffer: SessionDescription) {
              peerConnection.setLocalDescription(sdpObserver, sdpOffer)
              signaling.sendSdpOffer(sdpOffer)
          }

          ...

      }, MediaConstraints()
  )
}

In turn, the other peer:

  1. Also creates a PeerConnection object.
  2. Using the signal mechanism, receives the SDP-offer poisoned by the first peer and stores it in itself 
  3. Forms an SDP-answer and sends it back, also using the signal mechanism
fun onSdpOfferReceive(sdpOffer: SessionDescription) {// Saving the received SDP-offer
  peerConnection.setRemoteDescription(sdpObserver, sdpOffer)
  sendSdpAnswer()
}

// FOrming and sending SDP-answer
fun sendSdpAnswer() {
  peerConnection.createAnswer(
      object : SdpObserver {
          override fun onCreateSuccess(sdpOffer: SessionDescription) {
              peerConnection.setLocalDescription(sdpObserver, sdpOffer)
              signaling.sendSdpAnswer(sdpOffer)
          }
           …
      }, MediaConstraints()
  )
}

The first peer, having received the SDP answer, keeps it

fun onSdpAnswerReceive(sdpAnswer: SessionDescription) {
  peerConnection.setRemoteDescription(sdpObserver, sdpAnswer)
  sendSdpAnswer()
}

After successful exchange of SessionDescription objects, the logical connection is considered established. 

Physical connection 

We now need to establish the physical connection between the devices, which is most often a non-trivial task. Typically, devices on the Internet do not have public addresses, since they are located behind routers and firewalls. To solve this problem WebRTC uses ICE (Interactive Connectivity Establishment) technology.

Stun and Turn servers are an important part of ICE. They serve one purpose – to establish connections between devices that do not have public addresses.

Stun server

A device makes a request to a Stun-server and receives its public address in response. Then, using a signaling mechanism, it sends it to the interlocutor. After the interlocutor does the same, the devices recognize each other’s network location and are ready to transmit data to each other.

Turn server

In some cases, the router may have a “Symmetric NAT” limitation. This restriction won’t allow a direct connection between the devices. In this case, the Turn server is used. It serves as an intermediary and all data goes through it. Read more in Mozilla’s WebRTC documentation.

As we have seen, STUN and TURN servers play an important role in establishing a physical connection between devices. It is for this purpose that we when creating the PeerConnection object, pass a list with available ICE servers. 

To establish a physical connection, one peer generates ICE candidates – objects containing information about how a device can be found on the network and sends them via a signaling mechanism to the peer

lateinit var peerConnection: PeerConnection

fun createPeerConnection(iceServers: List<PeerConnection.IceServer>) {

  val rtcConfig = PeerConnection.RTCConfiguration(iceServers)

  peerConnection = peerConnectionFactory.createPeerConnection(
      rtcConfig,
      object : PeerConnection.Observer {
          override fun onIceCandidate(iceCandidate: IceCandidate) {
              signaling.sendIceCandidate(iceCandidate)
          }           …
      }
  )!!
}

Then the second peer receives the ICE candidates of the first peer via a signaling mechanism and keeps them for itself. It also generates its own ICE-candidates and sends them back

fun onIceCandidateReceive(iceCandidate: IceCandidate) {
  peerConnection.addIceCandidate(iceCandidate)
}

Now that the peers have exchanged their addresses, you can start transmitting and receiving data.

Receiving data

The library, after establishing logical and physical connections with the interlocutor, calls the onAddTrack header and passes into it the MediaStream object containing VideoTrack and AudioTrack of the interlocutor

fun createPeerConnection(iceServers: List<PeerConnection.IceServer>) {

   val rtcConfig = PeerConnection.RTCConfiguration(iceServers)

   peerConnection = peerConnectionFactory.createPeerConnection(
       rtcConfig,
       object : PeerConnection.Observer {

           override fun onIceCandidate(iceCandidate: IceCandidate) { … }

           override fun onAddTrack(
               rtpReceiver: RtpReceiver?,
               mediaStreams: Array<out MediaStream>
           ) {
               onTrackAdded(mediaStreams)
           }
           … 
       }
   )!!
}

Next, we must retrieve the VideoTrack from the MediaStream and display it on the screen. 

private fun onTrackAdded(mediaStreams: Array<out MediaStream>) {
   val videoTrack: VideoTrack? = mediaStreams.mapNotNull {                                                            
       it.videoTracks.firstOrNull() 
   }.firstOrNull()

   displayVideoTrack(videoTrack)

   … 
}

To display VideoTrack, you need to pass it an object that implements the VideoSink interface. For this purpose, the library provides SurfaceViewRenderer class.

fun displayVideoTrack(videoTrack: VideoTrack?) {
   videoTrack?.addSink(binding.surfaceViewRenderer)
}

To get the sound of the interlocutor we don’t need to do anything extra – the library does everything for us. But still, if we want to fine-tune the sound, we can get an AudioTrack object and use it to change the audio settings

var audioTrack: AudioTrack? = null
private fun onTrackAdded(mediaStreams: Array<out MediaStream>) {
   … 

   audioTrack = mediaStreams.mapNotNull { 
       it.audioTracks.firstOrNull() 
   }.firstOrNull()
}

For example, we could mute the interlocutor, like this:

fun muteAudioTrack() {
   audioTrack.setEnabled(false)
}

Sending data

Sending video and audio from your device also begins by creating a PeerConnection object and sending ICE candidates. But unlike creating an SDPOffer when receiving a video stream from the interlocutor, in this case, we must first create a MediaStream object, which includes AudioTrack and VideoTrack. 

To send our audio and video streams, we need to create a PeerConnection object, and then use a signaling mechanism to exchange IceCandidate and SDP packets. But instead of getting the media stream from the library, we must get the media stream from our device and pass it to the library so that it will pass it to our interlocutor.

fun createLocalConnection() {

   localPeerConnection = peerConnectionFactory.createPeerConnection(
       rtcConfig,
       object : PeerConnection.Observer {
            ...
       }
   )!!

   val localMediaStream = getLocalMediaStream()
   localPeerConnection.addStream(localMediaStream)

   localPeerConnection.createOffer(
       object : SdpObserver {
            ...
       }, MediaConstraints()
   )
}

Now we need to create a MediaStream object and pass the AudioTrack and VideoTrack objects into it

val context: Context
private fun getLocalMediaStream(): MediaStream? {
   val stream = peerConnectionFactory.createLocalMediaStream("user")

   val audioTrack = getLocalAudioTrack()
   stream.addTrack(audioTrack)

   val videoTrack = getLocalVideoTrack(context)
   stream.addTrack(videoTrack)

   return stream
}

Receive audio track:

private fun getLocalAudioTrack(): AudioTrack {
   val audioConstraints = MediaConstraints()
   val audioSource = peerConnectionFactory.createAudioSource(audioConstraints)
   return peerConnectionFactory.createAudioTrack("user_audio", audioSource)
}

Receiving VideoTrack is tiny bit more difficult. First, get a list of all cameras of the device.

lateinit var capturer: CameraVideoCapturer

private fun getLocalVideoTrack(context: Context): VideoTrack {
   val cameraEnumerator = Camera2Enumerator(context)
   val camera = cameraEnumerator.deviceNames.firstOrNull {
       cameraEnumerator.isFrontFacing(it)
   } ?: cameraEnumerator.deviceNames.first()
   
   ...

}

Next, create a CameraVideoCapturer object, which will capture the image

private fun getLocalVideoTrack(context: Context): VideoTrack {

   ...


   capturer = cameraEnumerator.createCapturer(camera, null)
   val surfaceTextureHelper = SurfaceTextureHelper.create(
       "CaptureThread",
       EglBase.create().eglBaseContext
   )
   val videoSource =
       peerConnectionFactory.createVideoSource(capturer.isScreencast ?: false)
   capturer.initialize(surfaceTextureHelper, context, videoSource.capturerObserver)

   ...

}

Now, after getting CameraVideoCapturer, start capturing the image and add it to the MediaStream

private fun getLocalMediaStream(): MediaStream? {
  ...

  val videoTrack = getLocalVideoTrack(context)
  stream.addTrack(videoTrack)

  return stream
}

private fun getLocalVideoTrack(context: Context): VideoTrack {
    ...

  capturer.startCapture(1024, 720, 30)

  return peerConnectionFactory.createVideoTrack("user0_video", videoSource)

}

After creating a MediaStream and adding it to the PeerConnection, the library forms an SDP offer, and the SDP packet exchange described above takes place through the signaling mechanism. When this process is complete, the interlocutor will begin to receive our video stream. Congratulations, at this point the connection is established.

Many to Many

We have considered a one-to-one connection. WebRTC also allows you to create many-to-many connections. In its simplest form, this is done in exactly the same way as a one-to-one connection. The difference is that the PeerConnection object, as well as the SDP packet and ICE-candidate exchange, is not done once but for each participant. This approach has disadvantages:

  • The device is heavily loaded because it needs to send the same data stream to each interlocutor
  • The implementation of additional features such as video recording, transcoding, etc. is difficult or even impossible

In this case, WebRTC can be used in conjunction with a media server that takes care of the above tasks. For the client-side the process is exactly the same as for direct connection to the interlocutors’ devices, but the media stream is not sent to all participants, but only to the media server. The media server retransmits it to the other participants.

Conclusion

We have considered the simplest way to create a WebRTC connection on Android. If after reading this you still don’t understand it, just go through all the steps again and try to implement them yourself – once you have grasped the key points, using this technology in practice will not be a problem. 

And this is the video chat for you! Android also allows you to create a custom call notification. After reading our guide on it, even those who are really new to coding will be able to do it!

Not an Android guy? We’ve got you covered with our WebRTC on iOS guide.

You can also refer to the following resources for a better understanding of WebRTC:

WebRTC documentation by Mozilla

Fora Soft article on WebRTC security

Categories
Uncategorized

WebRTC Security in Plain Language for Business People

webrtc-security

Let’s say you are a businessman and you want to develop a video conference or add a video chat to your program. How do you know what the developer has done is safe? What kind of protection can you promise your users? There are a lot of articles, but they are technical – it’s hard to figure out the specifics of security. Let’s explain in simple words.

WebRTC security measures consist of 3 parts: those offered by WebRTC, those provided by the browser, and those programmed by the developer. Let’s discuss the measures of each kind, how they are circumvented – WebRTC security vulnerabilities, and how to protect from them.

What is WebRTC?

WebRTC – Web Real-Time Communications – is an open standard that describes the transmission of streaming audio, video, and content between browsers or other supporting applications in real-time.

WebRTC is an open-source project, so anyone can do WebRTC code security testing, like here.

WebRTC works on all Internet-connected devices:

  • in all major browsers
  • in applications for mobile devices – e.g. iOS, Android
  • on desktop applications for computers – e.g., Windows and Mac
  • on smartwatches
  • on smart TV
  • on virtual reality helmets
WebRTC-supported devices
WebRTC supporting devices

To make WebRTC work on these different devices, the WebRTC library was created.

What kind of security does WebRTC offer?

Data encryption other than audio and video: DTLS

The WebRTC library incorporates the DTLS protocol. DTLS stands for Datagram Transport Layer Security. It encrypts data in transit, including keys for transmitting encrypted audio and video. Here you can find the official DTLS documentation from the IETF – Internet Engineering Task Force.

DTLS does not need to be enabled or configured beforehand because it is built in. The video application developer doesn’t need to do anything – DTLS in WebRTC works by default.

DTLS is an extension to the Transport Layer Security (TLS) protocol, which provides asymmetric encryption. Let’s take the example of a paper letter and parcel to understand what symmetric and asymmetric encryptions are.

We exchange letters. A postal worker can open a normal letter, it can be stolen and read. We wanted nobody to be able to read the letters but us. You came up with a way to encrypt them, like swapping letters in the text. In order for me to decipher your letters, you will have to describe how to decipher your cipher and send it to me. This is symmetric encryption: both you and I encrypt the letters and we both have the decryption algorithm – the key.

The weakness of symmetric encryption is in the transmission of the key. It can also be read by the letter carrier or this very letter with the key can be stolen.

The invention of asymmetric encryption was a major mathematical breakthrough. It uses one key to encrypt and another key to decrypt. It is impossible to know the decryption key without having the encryption key. That’s why an encryption key is called a public key – you can safely give it to anyone, it can only encrypt a message. The decryption key is called a private key – and it’s not shared with anyone.

Instead of encrypting the letter and sending me the key, you send me an open lock and keep the key. I write you a letter, put it in a box, put my open lock in the same box, and latch your lock on the box. I send it to you, and you open the box with your key, which has not passed to anyone else.

In symmetric encryption, keys are now disposable. For example, we made a call – the keys were created specifically for the call and deleted as soon as we hung up. Therefore, asymmetric and symmetric encryption are equally secure once the connection is established and keys are exchanged. The weakness of symmetric encryption is only that the decryption key has to be transferred.

But asymmetric encryption is much slower than symmetric encryption. The mathematical algorithms are more complicated, requiring more steps. That’s why asymmetric encryption is used in DTLS only to securely exchange symmetric keys. The data itself is encrypted with symmetric encryption. 

What data DTLS encrypts in WebRTC: all except video and audio
Text messages encryption with DTLS

How to bypass DTLS?

Cracking the DTLS cipher is a complex mathematical problem. It’s not considered to be done in a reasonable time without a supercomputer – and probably not with one either. It’s more profitable for hackers to look for other WebRTC security vulnerabilities. 

The only way to bypass DTLS is to steal the private key: steal your laptop or pick the password to the server. 

In the case of video calls through a media server, the server is a separate computer that stores its private key. If you access it, you can eavesdrop and spy on the call. 

It is also possible to access your computer. For example, you have gone out to lunch and left your computer on in your office. An intruder enters your office and downloads a file on your computer that will give him your private key. 

But first of all, it’s like stealing gas: to steal gas, you have to be sitting at the gas line. The intruder has to have access to the wires that transmit the information from you – or be on the same Wi-Fi network: sitting in the same office, for instance. But why go through all that trouble: you can simply upload a file to your computer that will write screen and sound and send it to the intruder. You may download such a malicious file from the Internet by accident yourself if you download unverified programs from unverified sites.

Second, this is not hacking DTLS encryption, but hacking your computer.

How to protect yourself from a DTLS vulnerability?

  • Don’t leave your computer turned on without your password.
  • Keep your computer’s password safe. If you are the owner of a video program, keep the password from the server where it is installed safely. Change your password on a regular basis. Don’t use the password that you use elsewhere.
  • Don’t download untested programs.
  • Don’t download anything from unverified sites.

Audio and video encryption: SRTP

DTLS encrypts everything but the video and audio. DTLS is secure but because of this, it’s slow. Video and audio are “heavy” types of data. Therefore, DTLS is not used for real-time video and audio encryption – it would be laggy. They are encrypted by SRTP – Secure Real-time Transport Protocol, which is faster but therefore less secure. The official SRTP documentation from the Internet Engineering Board.

What data SRTP encrypts in WebRTC: video and audio
SRTP for audio and video encryption

How to bypass SRTP?

2 SRTP security vulnerabilities:

1) Packet headers are not encrypted

SRTP encrypts the contents of RTP packets, but not the header. Anyone who sees SRTP packets will be able to tell if the user is currently speaking. The speech itself is not revealed, but it can still be used against the speaker. For example, law enforcement officials would be able to figure out if the user was communicating with a criminal.

2) Cipher keys can be intercepted

Suppose users A and B are exchanging video and audio. They want to make sure that no one is eavesdropping. To do this, the video and audio must be encrypted. Then, if they are intercepted, the intruder will not understand anything. User A encrypts his video and audio. Now no one can understand them, not even B. A needs to give B the key so that B can decrypt the video and audio in his place. But the key can also be intercepted – that’s the vulnerability of SRTP.

How to defend against SRTP attacks?

1) Packet headers are not encrypted

There is a proposed standard on how to encrypt packet headers in SRTP. As of October 2021, this solution is not yet included in SRTP; its status is that of a proposed standard. When it’s included in SRTP, its status will change to “approved standard”. You can check the status here, under the Status heading.

2) Cipher keys can be intercepted

There are 2 methods of key exchange:
1) via SDES – Session Description Protocol Security Descriptions
2) via DTLS encryption

1) SDES doesn’t support end-to-end encryption. That is, if there is an intermediary between A and B, such as a proxy, you have to give the key to the proxy. The proxy will receive the video and audio, decrypt them, encrypt them back – and pass them to B. Transmission through SDES is not secure: it is possible to intercept decrypted video and audio from the intermediary at the moment when they are decrypted, but not yet encrypted back.

2) The key is no longer “heavy” video or audio. It can be encrypted with reliable DTLS – it can handle key encryption quickly, no lags. This method is called DTLS-SRTP hybrid. Use this method instead of SDES to protect yourself.

IP Address Protection – IP Location Privacy

The IP address is the address of a computer on the Internet.

How IP address looks like
IP protection

What is the danger if an intruder finds out your IP address?

Think of IP as your home address. The thief can steal your passport, find out where you live, and come to break into your front door.

Once they know your IP, a hacker can start looking for vulnerabilities in your computer. For example, run a port check and find out what programs you have installed. 

For example, it’s a messenger. And there’s information online that this messenger has a vulnerability that can be used to log onto your computer. A hacker can use it as in the case above: when you downloaded an unverified program and it started recording your screen and sound and sending them to the hacker. Only in this case, you didn’t install anything yourself, you were careful. But the hacker downloaded this program to your computer through a messenger vulnerability. Messenger is just an example. Any program with a vulnerability on your computer can be used.

The other danger is that a hacker can use your IP address to determine where you are physically. This is how they stall in movies when negotiations with a terrorist happen to get a fix on their location.

How do I protect my IP address from intruders?

It’s impossible to be completely protected from this. But there are two ways to reduce the risks:

1) Postpone the IP address exchange until the user picks up the phone. So, if you do not take the call, the other party will not know your address. But if you do pick up, they will. This is done by suppressing JavaScript conversations with ICE until the user picks up the phone.

ICE – Internet Connectivity Establishment: It describes the protocols and routes needed for WebRTC to communicate with the remote device. Read more about ICE in our article WebRTC in plain language.

The downside:
Remember, social networks and Skype show you who’s online and who’s not? You can’t do that.

2) Don’t use p2p communication, but use an intermediary server. In this case, the interlocutor will only know the IP address of the intermediary, not yours.

The disadvantage:
All traffic will go through the intermediary. This creates other security problems like the one above about SDES.

If the intermediary is a media server and it’s installed on your server, it’s as secure as your server because it’s under your control. For measures to protect your server, see the SOP section below.

What security methods do browsers offer?

These methods are only for web applications running in a browser. For example, this doesn’t apply to mobile applications on WebRTC.

SOP – Same Origin Policy

When you open a website, the scripts needed to run that site are downloaded to your computer. A script is a program that runs inside the browser. Each script is downloaded from somewhere – the server where it is physically stored. This is its origin. One site may have scripts from different origins. SOP means that scripts downloaded from different origins do not have access to each other.

For example, you have a video chat site. It has your scripts – they are stored on your server. And there are third-party scripts – for example, a script to check if the contact form is filled out correctly. Your developer used it so he didn’t have to write it from scratch himself. You have no control over the third-party script. Someone could hack it: gain access to the server where it is stored and make that script, for example, request access to the camera and microphone of users on all sites where it is used. Third-party scripting attacks are called XSS – cross-site scripting.

If there were no SOP, the third-party script would simply gain access to your users’ cameras and microphones. Their conversations could be viewed and listened to or recorded by an intruder.

But the SOP is there. The third-party script isn’t on your server – it’s at another origin. Therefore, it doesn’t have access to the data on your server. It can’t access your user’s camera and microphone. 

But it can show the user a request to give him access to the camera and the microphone. The user will see the “Grant access to camera and microphone?” sign again, even though he has already granted access. This will look strange, but the user may give access thinking that he’s giving access to your site. Then the attacker would still be able to watch and listen to his conversations. The protection of the SOP is that without the SOP, access would not be requested again.

Access to the camera and microphone is just the most obvious example. The same goes for screen sharing, for example.

It’s even worse with text chat. If there were no SOP, it would be possible to send this malicious script to the chat room. Scripts aren’t displayed in chat: the user would see a blank message. But the script would be executed – and the attacker could watch and listen to his conversations and record them. With SOP the script will not run – because it is not on your server, but in another origin.

How to bypass SOP and how to protect yourself

3 SOP vulnerabilities: errors in CORS, connects via WebSocket, and Server hacking
What SOP protects from

1) Errors in CORS – Cross-Origin Resource Sharing

Complex web applications cannot work comfortably in an SOP environment. Even components of the same website can be stored on different servers – in different origins. Asking the user for permission every time would be annoying.

This is why developers are given the ability to add exceptions to the SOP – Cross-Origin Resource Sharing (CORS). The developer must list the origins-exceptions separated by a comma, or put “*” to allow all.
During the development process, there are often different versions of the site: the production version – available to real users, pre-production – available to the site owner for the final testing before posting to production, test – for testing by testers, the developer’s version. URLs of all versions are different. The programmer has to change the URL of exceptions from the SOP each time he transfers the version to another version. There is a temptation to put “*” to speed up. He can forget to replace the “*” in the list of exceptions in the production version, and then the SOP for your site will not work. It will become vulnerable to any third-party scripts.

How to protect against errors in CORS

To the developer – check for vulnerabilities from XSS: write exceptions from SOP, instead of “disabling” it by typing “*”.

To the user – revoke camera and microphone accesses that are no longer needed. The browser stores a list of permissions: to revoke, you must uncheck the box.

2) Replacing the server your server connects to via WebSocket

What is WebSocket?

Remember the CORS, the SOP exception that you have to set manually? There is another exception that is always in effect by default. This is WebSocket.

Why such an insecure technology, you ask? For real-time communication. The request technology that SOP covers doesn’t allow for real-time communication, because it’s one-way.

Imagine you’re driving in a car with a child in the back seat. You are server-side, the child is the client-side. The child asks you periodically: are we there? You answer “no.” In inquiry technology, when you finally arrive, you will not be able to say “we have arrived” to the child yourself. You have to wait for the child to ask. WebSocket allows you to say “arrived” yourself without having to wait for the question.

Examples from the field of programming: video and text chats. If WebSocket didn’t exist, the client side would have to periodically ask, “do I have incoming calls?”, “do I have messages?” Even if you ask once every 5 seconds, it’s already a delay. You can ask more often – once a second, for example. But then the load on the server increases, the server must be significantly more powerful, that is, more expensive. This is inefficient and this is why WebSocket was invented.

What is the vulnerability of WebSocket

WebSocket is a direct connection to the server. But which one? Well, normally yours. But what if the intruder replaces your server address with his own? Yes, his server address would not be at your origin. But the connection is through WebSocket, so the SOP won’t check it and won’t protect it.

What can happen because of this substitution? On the client-side, your text or video chat will receive a new message or an incoming call. It will appear to be one person writing or calling, but in fact, it will be an intruder. You may receive a message from your boss, such as “urgently send… my Gmail account password, the monthly earnings report” – whatever. You might get a call from an intruder pretending to be your boss, asking you to do something. If the voices are similar, you won’t even think that it might not be him – because the call is displayed as if it was from him.

How this can be done is a creative question. You have to look for vulnerabilities in the site. An example is XSS. You have a site with a video chat and a contact form, the messages from which are displayed in the admin panel of the site. A hacker sends the “replace the server address with this one” script to the contact form. The script appears in the admin panel along with all the messages from the contact form. Now it’s “inside” your site – it has the same source. SOP will not stop it. The script is executed, the server address is changed to this one.

How to protect against spoofing the server that your server connects to via WebSocket

  • Filter any data from users to scripts

    If the developer programmed not to accept scripts from users – the message from the contact form in the example above would not be accepted, and an intruder would not be able to spoof your server into his own on a WebSocket connection this way. You should always filter user messages for scripts, this will protect against server spoofing in WebSocket as well as many other problems.
  • Program a check that the connection through WebSocket is made to the correct origin

    For example, generate a unique codeword for each WebSocket connection. This codeword is not sent over the WebSocket, which means the SOP works. If a request for a codeword is sent to a third-party source, SOP will not allow it to be sent – because the third-party server is of a different origin.
  • Code obfuscation

    To obfuscate code is to make it incomprehensible while keeping it working. Programmers write code clearly – at least they should 🙂 So that if another developer adopts the code, he can make out in this code which part does what and work with this code. For example, programmers clearly name variables. The server address which is to be connected to via WebSocket is also a variable and will be named clearly, e.g. “server address for WebSocket connection”. After running the code through obfuscation, this variable will be called, for example, “C”. An outside intruder programmer will not understand which variable is responsible for what.

    The mechanism of codeword generation is stored in the code. Cracking it is an extra effort, but it is possible. If you make the code unreadable, the intruder won’t be able to find this mechanism in the code.

3) Server hacking

If your server gets hacked, a malicious third-party script can be “put” on your server. The SOP will not help: Your server is now the source of this script. This script will be able to take advantage of the camera and microphone access that the user has already given to your site. The script still won’t be able to send the recording to a third-party server, but it doesn’t need to. The attacker has access to your server: he can simply take the recording from there.

How a server can be hacked is not among WebRTC security issues, so it’s beyond the scope of this article. For example, an attacker could simply steal your server username and password.

How to protect yourself from the server hack

The most obvious thing is to protect the username-password. 

If your server is hacked, you can’t protect yourself from the consequences. But there are ways to make life difficult for the attacker.

1) Store all user content in encrypted form on the server. For example, records of video conferences. The server itself should be able to decrypt them. So, the server stores the decryption method. If the server is hacked, the attacker can find it. But that’s going to take time. He won’t be able to just swing by the server, copy the conversations and leave. The time he will have to spend on the compromised server will increase. This gives the server owner time to take some measures, such as finding the active session of the connected intruder and disabling him as an administrator and changing the server password.

2) Ideally, do not store user content on the server. For example, allow recording conferences, but don’t save them on the server, let the user download the file. Once the file is downloaded – only the user has it, it’s not on the server.

3) Give the user more options to protect himself – develop notifications in the interface of your program. We don’t recommend this method for everyone, because it’s inconvenient for the user. But if you are developing video calls for a bank or a medical institution, security is more important than convenience:

  1. Ask for access to the camera and microphone before each call.

    If your site gets hacked and they want to call someone on behalf of the user without their permission, the user will get a notification: “Do you want the camera and microphone access for the call?” He didn’t initiate that call, so it’s likely to keep the user safe: he’ll click “no.” It’s safe, but it’s inconvenient. What percentage of users will go to a competitor instead of clicking “allow” before every call?
  1. Ask for access to the camera and microphone to call specific users.

    Calling a user for the first time? See a notification saying “Allow camera and microphone access for calls to …Chris Baker (for example)?”. It’s less inconvenient for the user if they call the same people often. But it still loses in convenience to programs that ask for access only once.

Use a known browser from a trusted source

What is it?

The program you use to visit websites. Video conferencing works in the browser. When you use it, you assume the browser is secure.

How do attackers use the browser?

By injecting malicious code that does what the hacker wants.

How to protect yourself?

  • Don’t download browsers from untrusted sources.
    Here’s a list of official sites for the most popular browsers:
  • Don’t use unknown browsers
    Just like with the links. If a browser looks suspicious, don’t download it.
    You can give a list of safe browsers to the users of your web application. Although, if they are on your site, it means that they already use some browser… 🙂

What security measures should the developer think about?

WebRTC was built with security in mind. But not everything depends on WebRTC because it’s only a part of your program that is responsible for the calls. If someone steals the user’s password, WebRTC won’t protect it, no matter how secure the technology is. Let’s break down how to make your application more secure.

Signaling Layer

The Signaling Layer is responsible for exchanging the data needed to establish a connection. How connection establishment works, the developer writes – it happens before WebRTC and all its encryption comes into play. Simply put: When you’re sitting on a video call site and a pop-up pops up, “Call for you, accept/reject?” Before you hit “accept” it’s a signal layer, establishing a connection.

How can attackers use the signaling layer and how can they protect themselves?

There are many possibilities to do this. Let’s look at the 3 primary ones: Man-in-the-Middle attack, Replay attack, Session hijacking.

Attack on signalling layer: 2 people in process of establishing a connection, the intruder connects in the middle
  • MitM (Man-in-the-Middle) attack

In the context of WebRTC, this is the interception of traffic before the connection is established – before the DTLS and SRTP encryption described above comes into effect. An attacker sits between the callers. He can eavesdrop and spy on conversations or, for example, send a pornographic picture to your conference – this is called zoombombing.

This can be any intruder connected to the same Wi-fi or wired network as you – he can watch and listen to all the traffic going on your Wi-fi network or on your wire.

How to protect yourself?

Use HTTPS instead of HTTP. HTTPS supports SSL/TLS encryption throughout the session. Man-in-the-middle will still be able to intercept your traffic. But the traffic will be encrypted and he won’t understand it. He can save it and try to decrypt it, but he won’t understand it right away.

SSL – Security Sockets Layer – is the predecessor to TLS. It turns HTTP into HTTPS, securing the site. Users used to go to HTTP and HTTPS sites without seeing the difference. Now HTTPS is a mandatory standard: developers have to protect their sites with SSL certificates. Otherwise, the browsers won’t let the user go to the site: they’ll show that dreaded “your connection is not secured” message – and only by clicking “more” can the user click “still go to the site”. Not all users will click “go anyway”, that’s why all developers now add SSL certificates to sites.

  • Replay attack

You have protected yourself from Man-in-the-middle with HTTPS. Now the attacker hears your messages but does not understand them. But he hears them! And therefore, he can repeat – replay. For example, you gave the command “transfer 100 dollars”. And the attacker, though he does not understand it, repeats “transfer 100 dollars” – and without additional protection, the command will be executed. From you will be written off 100 dollars 2 times, and the second 100 dollars will be sent in the same place where the first.

How to protect yourself?

Set a random session key. This key will be active during one session and cannot be used twice. “Send $100. ABC”. If an intruder repeats “transfer $100. ABC” – it will become clear that the message is repeated and it should not be executed. This is exactly what we did in the NextHuddle project – a video conferencing service for educational events. NextHuddle is designed for an audience of 5000 users and 25 streamers.

  • Session hijacking

Session hijacking is when a hacker takes over your Internet session. For example, you call the bank. You say who you are, your date of birth, or a secret word. “Okay, we recognize you. What do you want?” – and then the intruder takes the phone receiver from you and tells them what he wants.

How do you protect yourself?

Use HTTPS. You have to be man-in-the-middle to hijack the session. So what protects against man-in-the-middle also protects against session hijacking.

Selecting the DTLS Encryption Bit

DTLS is an encryption protocol. The protocol has encryption algorithms such as AES. AES has bits – 128 or more complex and protected 256. In WebRTC they are chosen by the developer. Make sure that the bit selected for AES is the one that gives the highest security, 256.

AES-256 encryption compared to AES-128 as a bigger lock against a smaller one
Encryption bits security

You can read how to do this in the Mozilla documentation, for example. A certificate is generated and when you create a peer connection you pass on this certificate.

Authentication and member tracking

The task of the developer is to make sure that everyone who enters the video conference room is authorized to do so.

Example 1 – private rooms: for example, a paid video lesson with a teacher. The developer should program a check: has the user paid for the lesson? If he has paid, let him in, and if he hasn’t, don’t let him in.

This seems obvious, but we have encountered many cases where you can copy the URL of such a paid conference and send it to anyone and he goes and visits the conference even though he did not pay for the lesson.

Example 2 – open rooms: for example, business video conferences of the “join without registration” type. This is done for convenience: when you don’t want to make a business partner waste time and register. You just send him a link, he follows it and gets in the conference.

If there are not so many participants, the owner himself will see if someone has joined too much. But if a lot do, the owner won’t notice. One way out is for the developer to program the manual approval of new participants by the owner of the conference.

Example 3 – helping the user to protect his login and password. If an intruder gets hold of a user’s login and password, he will be able to log in with it.

Program the login through third-party services. For example, social networks, Google login, or Apple login on mobile devices. You may not use a password, but send a login code to your email or phone. This will reduce the number of passwords a user has to keep. The thief would not need to steal a password from your program, but a password from a third-party service such as a social network, mobile account, or email. 

You can use two ways at once – for example, the username and password from your program plus a confirmation code on your phone. Then, in order to hack your account, you will need to steal two passwords instead of one.

Phone in a hand with a verification code input screen to protect login
Verification UI

Not all users will want to log in that hard and long to call. A choice can be given: one login method or two. Those who care about security will choose two, and be grateful. 

Access Settings

Let’s be honest – we don’t always read the access settings dialogs. If the user is used to clicking OK, the application may get permissions he didn’t want to give. 

The other extreme measure, the user may delete the app if they don’t immediately understand why they’re being asked for access.

Good and bad example of permission request in apps
Location protection UI

The solution is simple – show care. Write clearly what permissions the user gives and why.

For example, in mobile applications: before showing a standard pop-up requesting access to geolocation, show an explanation like “People in our chat room call nearby. Allow geolocation access, so we can show you the people nearby.

Screen sharing

Any app that gives a screen demo feature should have a warning about exactly what the user is showing. 

For example, before a screencasting session, when the user selects the area of the screen to be shown. Make a reminder notification so that the user doesn’t accidentally show a piece of the screen with data they don’t want to show. “What do you want to share?” – and options: ” – entire screen, – only one application – select which one, such as just the browser.”

Choice what app to share when screensharing
Screensharing protection

If you gave the site permission to do a screen share, and the site gets hacked, the hacker can send you a script that opens some web page in your browser while you’re doing the screen share. For example, he knows how links to social networking posts are formed. He has formed a link to your correspondence with a particular person that he wants to see. He’s not logged in to your social network – so when he follows that link, he won’t see your correspondence. But if he’s hacked into a site that you’ve allowed to show the screen, the next time you show the screen there he’ll execute a script that will open that page with the correspondence in your browser. You will rush to close it, but too late: the screencasting has already passed it to the intruder. The protection against this is the same as against hacking the server – keep your passwords safe. But it is difficult to do. What’s easier is not to hack the site, but to send a fake link requesting screen sharing.

Where to read more about WebRTC security

There are many articles on the internet about security in WebRTC. There are 2 problems with them:

1) They merely express someone’s subjective opinion. Our article is no exception. The opinion may be wrong.

2) Most articles are technical: it might be difficult for somebody who’s not a programmer to understand.

How to solve these problems?

1) Use the scientific method of research: read primary sources, the publications confirmed by someone’s authority. In scientific work, these are publications in Higher Attestation Commission (HAC) journals – before publication in them, the work must be approved by another scientist from the HAC. In IT these are the W3C – World Wide Web Consortium and the IETF – Internet Engineering Task Force. The work is approved by technical experts from Google, Mozilla, and similar corporations before it is published.
WebRTC security considerations from the W3C specification – in brief
WebRTC security considerations from the IETF – details on threats, a bit about protecting against them
IETF’s WebRTC security architecture – more on WebRTC threat protection

2) The documentation above is correct but written in such technical language that a non-technical person can’t figure it out. Most of the articles on the internet are the same way. That’s why we wrote this one. After reading it:
– The basics will become clear to you (hopefully). Maybe this will be enough to make a decision.
– If not, the primary sources will be easier for you to understand. Cooperate with your programmer – or reach out to us for advice.

Conclusion

Security of a WebRTC: WebRTC alone = 1 shield of 3, WebRTC + Good developer = 3 shields of 3
WebRTC is secure but it’s even more secure with a good developer

WebRTC itself is secure. But if the developer of a WebRTC-based application doesn’t take care of security, his users will not be safe.

For example, in WebRTC all data except video and audio is encrypted by DTLS, and audio and video are encrypted by SRTP. But many WebRTC security settings are chosen by the developer of the video application: for example, how to transfer keys to SRTP – by DTLS top-level security or not.

Furthermore, WebRTC is only a way to transmit data when the connection is already established. What happens to users before the connection is established is entirely up to the developer: as he programs it, so it will be. What SOP exceptions to set, how to let users in a conference, whether to use HTTPS – all this is up to the developer.

Write to us, we’ll check your video application for security. 

Check out our Instagram – we post projects there, most of which were made on WebRTC.

Categories
Uncategorized

Kurento Media Server: Everything You Need To Know In 2022

kurento-media-server

A short review on Kurento in simple language for business people and those who’re not into all that techy stuff.

Imagine that a programmer approached you and said that he needs a media server for development, and he recommends that you use Kurento. How do you know whether it’s the best choice? There’s a lot of information but digesting it might be difficult as it’s all deeply technical.

We’ll try our best to provide you with enough information to make a decision whether or not you should use Kurento in your video app. We’ll tell you why the media server is important, about the license, architecture, main functions that you can develop with Kurento, those being modules. We’ll finish with the summary of when it’s best to use a low-level media server, such as Kurento, and when it’s best to use an out-of-the-box solution.

WebRTC and why we need a media server

Kurento is an open-source WebRTC media streaming server with many built-in video conferencing modules released under the Apache license. WebRTC is a standardized, low latency, real-time, browser-to-browser transmission method without the need for third-party plugins or extensions. WebRTC is a fully client-side technology, so why would we need a media server?

The main reason is the load on the client with a large number of participants. The number of connections between participants grows exponentially, at the same time video quality worsens and the load on traffic and system resources grows. WebRTC can be used as normal P2P communication between 2-6 (in our experience), but if there are more participants, it makes more sense to use a media server. In addition, there are difficulties if we need to save a video recording to a separate file or somehow process it on the fly because all the work lies on the client.

WebRTC is also quite a secure option. Why? We explain here.

A short history of Kurento

Kurento was developed in 2010 in Madrid as a separate open-source project. The main language that Kurento uses is C++, which helps to optimize the resources of the system.

The media server has more than 2 900 stars on GitHub and a few hundred forks, which are separate branches of the project supported by the community.

At the moment Kurento development team joined Twilio and there are minor versions of Kurento itself with some minor patches, and new versions are still being released.

Apache, the license of Kurento

Kurento is released under the Apache license. It gives the developers absolute freedom when working with a code. You just have to mention what changes you apply and who the first author is.

Products with software under the Apache license can be used in commercial products. You can use Kurento in your products for free and get income from those products. No need to pay any loyalty.

You can find the full text about Apache here.

Kurento architecture: MCU and SFU

There are two main types of media server architectures: MCU and SFU. We will give an overview in this article. If you’re interested in more details, check out this article.

MCU (Multipoint Conferencing Unit) is a collage-like video architecture. We have multiple streams of users which make up one big seamless picture with each item’s location unchanged. The MCU takes an outgoing video stream from each participant, then the media server stitches all the streams into one with a fixed layout. Thus, despite the fact that there are many participants in the conference, each client receives only one stream as input. This allows to save CPU resources and traffic consumption on the client-side, but increases the load on the server itself and limits the possibilities of video chat layout customization. With MCU we can’t tell exactly which participant is on the video. Mixing streams requires a lot of processing power, increasing the cost of maintaining the server. This type of architecture is mostly suitable for meetings with a large number of participants (> 40). MCU is also a good solution if you need to get video streams on weak devices e.g. on the phone thanks to processing on the server.

SFU (Selective Forwarding Unit) is a popular architecture in modern WebRTC solutions that allows the video conference client to receive only the video streams it needs at the moment. SFU is more like a mosaic where you have to assemble the elements yourself, but the order of assembly is up to you. In SFU each participant also sends his stream to the server, but the other stream comes separately. This architecture better distributes the load between server and client and gives full control over the implementation of the video chat interface. Unlike MCU, a server with SFU doesn’t have to decode and transcode incoming streams. This helps to significantly reduce the load on the server CPU. SFU is well suited for broadcasting (one-to-many or one-person streaming) due to its ability to dynamically scale the system depending on the number of streams. At the same time, this type of server requires more outgoing server bandwidth because it has to do more streams to clients.

Let’s compare these 2 types using the 4-people video conference as an example.

On the client:

compare-MCU-to-SFU-on-the-client
Differences between SFU and MCU for the client

On the server:

compare-MCU-to-SFU-on-the-server
Differences between SFU and MCU for the server

There are also systems that use hybrid architectures to achieve the best result depending on the current number of users and the needs of a particular client. For example, if the client is a weak mobile device, it can receive a single stream from the media server as in the MCU. Browser users, on the other hand, will receive streams separately, where it is possible to implement a unique way of meshing elements as in SFU. Another case would be interaction with SIP devices (mostly IP telephony) where only the MCU is supported. Some hybrids may change from SFU to MCU on the fly when the number of participants reaches a certain threshold. There is the XDN (Experience Delivery Network) architecture from red5pro, which uses cloud technologies to solve WebRTC scaling problems. It has clusters that consist of different types of nodes. There are the source, relay, and edge nodes. Within this topology, any source node receives incoming streams and exchanges data with several edge nodes to support thousands of participants. For larger cases, source nodes can pass the flow to relay nodes, which pass the flow to multiple edge nodes to scale the cluster even more.

Kurento allows both types, using SFU by default. MCU architecture can be achieved using the Compositor element. You can mix SFU and MCU to obtain a hybrid type. Kurento uses SFU by default.

Description of the main modules and functionality

What are modules: main principles of Kurento design

Kurento includes several basic modules tailored to work not only with WebRTC, but also with video recording, computer vision, and AR filters. The media server would be a good choice if you want to work directly with regular WebRTC without using additional wrappers. It can be useful if you want to integrate your video chat with a native Android and iOS app. Kurento does not include a signaling mechanism, you can choose what works best for you depending on your project requirements, be it WebSockets or something else. Since Kurento is an open-source project, this significantly reduces the cost of using a media server.

Unlike the off-the-shelf paid media servers, Kurento is a rather low-level solution that allows you to customize the interaction of modules in the pipeline. Also, unlike, for example, Jitsi, which is a boxed solution and has a ready-made interface, with Kurento it’s much easier to choose and implement the interface you want. And of course, you can write your own module extending the standard media server functionality. The basis of the Kurento architecture design is modularity, where each media element is a program block that performs a certain task and can interact with other elements. In Kurento jargon, the developers call this Media Pipeline and Media Elements. Media Elements are simply modules that connect to each other within the Pipeline. Note, however, that the topology of the Pipeline is chosen by the developers, it does not necessarily have to be a linear sequence of elements.

Let’s talk about the main modules now.

WebRTC video conference

The basis of Kurento for video conferencing is the WebRtcEndpoint. Using it, we can transfer WebRTC streams and interconnect them in a single pipeline. In a video conference, the pipelines can usually be thought of as one room that is isolated from other similar rooms. In this case, WebRtcEndpoint is the user who wants to broadcast their own video or watch someone else’s.

Recording video and audio

The media server uses RecorderEndpoint to record. We can connect the WebRTC stream to the recorder, and the recorder will record the result to the file system. After connecting and starting the video just call the Kurento API method “record” of the recorder. We can specify restrictions when recording: for example, we can record only audio or only video.

Playing third-party media

Kurento has a built-in PlayerEndpoint module that allows you to play third-party audio and video files as well as RTMP and RTSP streams. This means that you can add a stream from the IP camera and broadcast it to other participants of the video conference. You can use the player to play music for users, which is good for music platforms, online training, or even text-to-speech. Thanks to the flexibility of the pipeline, it’s possible to play media not only to all participants but also to a specific user depending on your needs. You can do this simply by creating a new PlayerEndpoint and connecting it to the desired WebRtcEndpoint using the connect method. Once you need to play a media file, just call the player’s play method.

Computer vision and filters

The media server allows you to use many filters. For example, ZBarFilter is used to recognize QR codes, FaceOverlayFilter can recognize a face in a video stream and highlight it in real-time. And with GStreamerFilter, you can create your own custom filter. Also, Kurento has several experimental modules, including filters which are installed separately. There is kms-crowddetector that can detect a crowd, and there is a kms-platedetector that can detect a vehicle’s license plate number.

No support for multi-streaming with different resolutions

There is no support for Simulcast (sending multiple copies of the same stream in different quality) and SVC (sending a stream in low quality with the ability to add additional layers to improve quality if needed).

There’s a solution. You can create several streams and request videos with different resolutions. Switch between video streams depending on packet losses. However, this solution is far from ideal as it puts a huge load on the server. So let’s be honest: there’s no solution that’d be good per se.

Codec support in Kurento

In order to stream over a network, computers use codecs. Codecs are programs that can perform data or signal encoding. They allow you to compress video or audio to an acceptable size for network streaming and playback. 

Kurento Media Server supports codecs like H.264, VP8/9 for video, and OPUS for audio. They allow you to achieve a high degree of compression of the video stream while maintaining high quality. These codecs are the current WebRTC standards. There is no support for AV1 and H.265. These are the latest standards that allow you to reduce the bitrate for the same quality as their older counterparts. The media server itself compresses the received video stream to a quality that matches the bandwidth to the client at that moment, and if necessary, transcodes the stream into the codec the client needs.

API, documentation, Typescript support

The Kurento API can be accessed via the kurento-client library and various additional libraries, e.g. kurento-utils. You can find detailed documentation on the official Kurento website with the description of every module and usage examples for Java and Javascript. you can go either here or here. The site provides information not only about the media server itself but also about related WebRTC themes, such as configuring the TURN (Traversal Using Relay NAT) server, necessary to bypass the communication limitations of users behind a symmetric NAT. The TURN protocol allows you to get the IP address and port needed to establish a WebRTC connection in the face of NAT restrictions. Symmetric NATs have additional protection against transport data tampering. The symmetric NAT table stores 2 more parameters – IP and port of the remote host. Packets from the external network are discarded because the source data doesn’t match the data recorded in the table. The client library supports types for Typescript.

OpenVidu

Based on Kurento, there is the OpenVidu framework, which is designed for simpler development if you don’t need anything other than simple video conferencing. No need to use a server to communicate with the framework, just connect the client library. The library has many wrappers for all popular frontend frameworks as well as for Android and iOS. OpenVidu is a free solution, but there is a paid version that provides additional features for monitoring and scaling the WebRTC platform. And there are also future plans for Simulcast and SVC and automatic switching to P2P sessions for 1-1 cases.

The number of participants in a video conference

Kurento developers have their own benchmark to determine the maximum number of sessions on one machine. Although it’s originally intended for OpenVidu, it’s also suitable for Kurento. Below you can see a table for the different machines on AWS.

video-conferencing-AWS-benchmarks
Video conferencing benchmarks for AWS machines

For example, a single AWS c5.large instance with two CPU cores and 4 GB of RAM can handle 4 group-type meetings with 7 participants. In each such meeting, users show and watch each other’s streams (many-to-many). 

Summary: when is Kurento suitable?

  • for a video conferencing platform
  • for integration with AR,
  • for video recording
  • for playing media files for conference participants
  • for working with pure RTP, for example, for live streaming.

Kurento is a low-level media server. If you use Kurento, development might take a bit more time than out-of-the-box solutions. That’s if your developer doesn’t specialize in Kurento.

However, Kurento allows you to implement any functionality you want. It often happens with ready-made solutions that the customer asks to create something else, but that functionality isn’t supported.

If you’re interested in Kurento media server installation for your project, request a free quote, and we’ll get back to you ASAP!

Categories
Uncategorized

How to Make a Custom Call Notification on Android? With Code Examples

How to create a custom Android call notification

Channel creation (api 26+)
Displaying a notification
Button actions upon clicking
Notifications with their own design
Full-screen notifications
Conclusion

You will learn how to make incoming call notifications on Android from basic to advanced layouts from this article. Customize the notification screen with our examples.

Last time, we told you what any Android app with calls should have and promised to show you how to implement it. Today we’ll deal with notifications for incoming calls: we’ll start with the simplest and most minimalistic ones, and end with full-screen notifications with an off-system design. Let’s get started! 

Channel creation (api 26+)

Since Android 8.0, each notification must have a notification channel to which it belongs. Before this version of the system, the user could either allow or disallow the app to show notifications, without being able to turn off only a certain category, which was not very convenient. With channels, on the other hand, the user can turn off annoying notifications from the app, such as ads and unnecessary reminders, while leaving only the ones he needs (new messages, calls, and so on).

If we don’t specify a channel ID, using the Deprecated builder. If we don’t create a channel with such an ID, the notification will not be displayed with the Android 8 or later versions.

We need the androidx.core library which you probably already have hooked up. We write in Kotlin, so we use the version of the library for that language:

dependencies {
    implementation("androidx.core:core-ktx:1.5.0")
}

All work with notifications is done through the system service NotificationManager. For backward compatibility, it is always better to use the Compat version of Android classes if you have them, so we will use NotificationManagerCompat. To get the instance:

val notificationManager = NotificationManagerCompat.from(context)

Let’s create our channel. You can set a lot of parameters for the channel, such as a general sound for notifications and a vibration pattern. We will set only the basic ones, and the full list you can find here.

val INCOMING_CALL_CHANNEL_ID = “incoming_call”

// Creating an object with channel data

val channel = NotificationChannelCompat.Builder(

    // channel ID, it must be unique within the package

    INCOMING_CALL_CHANNEL_ID,

    // The importance of the notification affects whether the notification makes a sound, is shown immediately, and so on. We set it to maximum, it’s a call after all.

    NotificationManagerCompat.IMPORTANCE_HIGH

)

    // the name of the channel, which will be displayed in the system notification settings of the application

    .setName(“Incoming calls”)

    // channel description, will be displayed in the same place

    .setDescription(“Incoming audio and video call alerts”)

    .build()

// Creating the channel. If such a channel already exists, nothing happens, so this method can be used before sending each notification to the channel.

notificationManager.createNotificationChannel(channel)
create-channel-notifcation-on-android
How to create notification channel on Android

Displaying a notification

Wonderful, now we can start creating the notification itself, let’s start with the simplest example:

val notificationBuilder = NotificationCompat.Builder( 

this, 

    // channel ID again

    INCOMING_CALL_CHANNEL_ID

)

    // A small icon that will be displayed in the status bar

    .setSmallIcon(R.drawable.icon)

    // Notification title

    .setContentTitle(“Incoming call”)

    // Notification text, usually the caller’s name

    .setContentText(“James Smith”)

    // Large image, usually a photo / avatar of the caller

    .setLargeIcon(BitmapFactory.decodeResource(resources, R.drawable.logo))

    // For notification of an incoming call, it’s wise to make it so that it can’t be “swiped”

    .setOngoing(true)

        So far we’ve only created a sort of “description” of the notification, but it’s not yet shown to the user. To display it, let’s turn to the manager again:

// Let’s get to building our notification

val notification = notificationBuilder.build()

// We ask the system to display it

notificationManager.notify(INCOMING_CALL_NOTIFICATION_ID, notification)
set-up-display-android-notification
How to display a notification for Android

    The INCOMING_CALL_NOTIFICATION_ID is a notification identifier that can be used to find and interact with an already displayed notification.

        For example, the user wasn’t answering the call for a long time, the caller got tired of waiting and canceled the call. Then we can cancel notification:

notificationManager.cancel(INCOMING_CALL_NOTIFICATION_ID)

        Or, in the case of a conferencing application, if more than one person has joined the caller, we can update our notification. To do this, just create a new notification and pass the same notification ID in the notify call — then the old notification will just be updated with the data, without animating the appearance of the new notification. To do this, we can reuse the old notificationBuilder by simply replacing the changed part in it:

notificationBuilder.setContentText(“James Smith, George Watson”)

notificationManager.notify(

    INCOMING_CALL_NOTIFICATION_ID, 

    notificationBuilder.build()

)

Button actions upon clicking

A simple notification of an incoming call, after which the user has to find our application himself and accept or reject the call is not a very useful thing. Fortunately, we can add action buttons to our notification!

To do this, we add one or more actions when creating the notification. Creating them will look something like this:

val action = NotificationCompat.Action.Builder(

    // The icon that will be displayed on the button (or not, depends on the Android version)

    IconCompat.createWithResource(applicationContext, R.drawable.icon_accept_call),

    // The text on the button

    getString(R.string.accept_call),

    // The action itself, PendingIntent

    acceptCallIntent

).build()

Wait a minute, what does another PendingIntent mean? It’s a very broad topic, worthy of its own article, but simplistically, it’s a description of how to run an element of our application (such as an activity or service). In its simplest form it goes like this:

const val ACTION_ACCEPT_CALL = 101

// We create a normal intent, just like when we start a new Activity

val intent = Intent(applicationContext, MainActivity::class.java).apply {

    action = ACTION_ACCEPT_CALL

}

// But we don’t run it ourselves, we pass it to PendingIntent, which will be called later when the button is pressed

val acceptCallIntent = PendingIntent.getActivity(applicationContext, REQUEST_CODE_ACCEPT_CALL, intent, PendingIntent.FLAG_UPDATE_CURRENT)

Accordingly, we need to handle this action in activity itself

To do this, in `onCreate()` (and in `onNewIntent()` if you use the flag `FLAG_ACTIVITY_SINGLE_TOP` for your activity), take `action` from `intent` and take the action:

override fun onNewIntent(intent: Intent?) {

    super.onNewIntent(intent)

    if (intent?.action == ACTION_ACCEPT_CALL) 

        imaginaryCallManager.acceptCall()

}

Now that we have everything ready for our action, we can add it to our notification via `Builder`

notificationBuilder.addAction(action)
add-buttons-to-android-notification
How to add notification buttons on Android

In addition to the buttons, we can assign an action by clicking on the notification itself, outside of the buttons. Going to the incoming call screen seems like the best solution — to do this, we repeat all the steps of creating an action, but use a different action id instead of `ACTION_ACCEPT_CALL`, and in `MainActivity.onCreate()` handle that `action` with navigation

override fun onNewIntent(intent: Intent?) {

    …

    if (intent?.action == ACTION_SHOW_INCOMING_CALL_SCREEN)

        imaginaryNavigator.navigate(IncomingCallScreen())

}

You can also use `service` instead of `activity` to handle events.

Notifications with their own design

Notifications themselves are part of the system interface, so they will be displayed in the same system style. However, if you want to stand out, or if the standard arrangement of buttons and other notification elements don’t suit you, you can give the notifications your own unique style.

DISCLAIMER: Due to the huge variety of Android devices with different screen sizes and aspect ratios, combined with the limited positioning of elements in notifications (relative to regular application screens), Custom Content Notification is much more difficult to support

The notification will still be rendered by the system, that is, outside of our application process, so we need to use RemoteViews instead of the regular View. Note that this mechanism does not support all the familiar elements, in particular, the `ConstraintLayout` is not available.

A simple example is a custom notification with one button for accepting a call:

<!– notification_custom.xml –>

<RelativeLayout 

    …

    android:layout_width=”match_parent”

    android:layout_height=”match_parent”>

    <Button

        android:id=”@+id/button_accept_call”

        android:layout_width=”wrap_content”

        android:layout_height=”wrap_content”

        android:layout_centerHorizontal=”true”

        android:layout_alignParentBottom=”true”

        android:backgroundTint=”@color/green_accept”

        android:text=”@string/accept_call”

        android:textColor=”@color/fora_white” />

</RelativeLayout>.

The layout is ready, now we need to create an instance RemoteViews and pass it to the notification constructor

val remoteView = RemoteViews(packageName, R.layout.notification_custom)

// Set the PendingIntent that will “shoot” when the button is clicked. A normal onClickListener won’t work here – again, the notification will live outside our process

remoteView.setOnClickPendingIntent(R.id.button_accept_call, pendingIntent)

// Add to our long-suffering builder

notificationBuilder.setCustomContentView(remoteView)
create-custom-android-notification
How to create a custom notification on Android

Our example is as simplistic as possible and, of course, a bit jarring. Usually, a customized notification is done in a style similar to the system notification, but in a branded color scheme, like the notifications in Skype, for example.

In addition to .setCustomContentView, which is a normal notification, we can separately specify mark-up for expanded state .setCustomBigContentView and for the head-up state .setCustomHeadsUpContentView

Full-screen notifications

Now our custom notification layouts match the design inside the app, but they’re still small notifications, with small buttons. And what happens when you get a normal incoming call? Our eyes are presented with a beautiful screen that takes up all the available space. Fortunately, this functionality is available to us! And we’re not afraid of any limitations associated with RemoteViews, as we can show the full `activity`.

First of all, we have to add a permission to `AndroidManifest.xml`

<uses-permission android:name="android.permission.USE_FULL_SCREEN_INTENT" />

After creating an `activity` with the desired design and functionality, we initialize the PendingIntent and add it to the notification:

val intent = Intent(this, FullscreenNotificationActivity::class.java)

val pendingIntent = PendingIntent.getActivity(applicationContext, 0, intent, PendingIntent.FLAG_UPDATE_CURRENT)

// At the same time we set highPriority to true, so what is highPriority if not an incoming call?

notificationBuilder.setFullScreenIntent(pendingIntent, highPriority = true)

Yes, and that’s it! Despite the fact that this functionality is so easy to add, for some reason not all call-related applications use it. However, giants like Whatsapp and Telegram have implemented notifications of incoming calls in this way!

create-full-screen-android-notification
How to create a full screen notification on Android

Bottom line

The incoming call notification on Android is a very important part of the application. There are a lot of requirements: it should be prompt, eye-catching, but not annoying. Today we learned about the tools available to achieve all these goals. Let your notifications be always beautiful!

Read other articles from this cycle:

How to Make Picture-in-Picture Mode on Android With Code Examples

How to Implement Audio Output Switching During the Call on Android App?

How to Implement Foreground Service and Deep Links for Android apps with calls?

Liked this article? Then the next step is learning how to make a video chat on Android. An explanation on how to do it for beginners is here.

Categories
Uncategorized

Video conference and text chat software development

video conference

Major governmental institutions and telecom operators use imind.com. We developed a new version of their video conference and chat platform for businesses. Agencies have meetings there and telecom companies resell it to other enterprises under their brands.

Features

Industries 

Devices 

Technologies 

Costs

Features for video, audio, and text communication software

WebRTC videoconference

We develop for any number of participants:

  • One-on-one video chats
  • Video conferences with an unlimited number of participants

50 live videos on one screen at the same time was the maximum we’ve done. For example, Zoom has 100 live video participants, though it shows 25 live videos on one screen. To see the others, you switch between screens.

Some other functions: custom backgrounds, enlarging videos of particular participants, picking a camera and microphone from the list, muting a camera and microphone, and a video preview of how you look.

Conference recording

Record the whole screen of the conference. Set the time to store recordings on the server. For example, on imind.com we keep videos for 30 days on a free plan forever on the most advanced one.

Do not interrupt the recording if the recorder dropped off. In Zoom, if the recorder leaves, the recording stops. In imind.com it continues.

Screen sharing and sharing multiple screens simultaneously

Show your screen instead of a video. Choose to show everything or just 1 application – to not show private data accidentally.

Make all video participants share screens at the same time. It helps to compare something. Users don’t have to stop one sharing and start another one. See it in action at imind.com.

Join a conference from a landline phone

For those in the countryside without an Internet connection. Dial a phone number on a wired telephone or your mobile and enter the conference with audio, without a video. SIP technology with Asterisk and FreeSWITCH servers powers this function.

Text chat

Send text messages and emoticons. React with emojis. Send pictures and documents. Go to a private chat with one participant. See a list of participants.

Document editing and signing

Share a document on the conference screen. Scroll through it together, make changes. Sign: upload your signature image or draw it manually. Convenient for remote contract signing in the pandemic.

Polls

Create polls with open and closed questions. View statistics. Make the collective decision-making process faster!

Webinars

In the broadcast mode, display a presentation full-screen to the audience, plus the presenter’s video. Add guest speakers’ videos. Record the whole session to share with participants afterward.

Everlasting rooms with custom links

Create a room and set a custom link to it like videoconference.com/dailymeeting. It’s convenient for regular meetings. Ask participants to add the link to bookmarks and enter at the agreed time each time.

User management

Assign administrators and delegate them the creation of rooms, addition, and deletion of users.

Security

  • One-time codes instead of passwords
  • Host approves guests before they enter the conference
  • See a picture of the guest before approving him
  • Encryption: we enable AES-256 encryption in WebRTC

Custom branding

Change color schemes, use your logo, change backgrounds to corporate images.

Speech-to-text and translation

User speech is recognized and shown on the screen. It can be in another language for translation.

Watch videos together online

Watch a movie or a sports game together with friends. Show an employee onboarding video to the new staff members. Chat by video, voice, and text.

Subscription plans

Free plans with basic functionality, advanced ones for pro and business users.

Industries Fora Soft developed real-time communication tools for

  • Businesses – corporate communication tools
  • Telemedicine – HIPAA-compliant, with EMR, visit scheduling, and payments
  • E-learning – with whiteboards, LMS, teacher reviews, lesson booking, and payments
  • Entertainment: online cinemas, messengers
  • Fitness and training
  • Ecommerce and marketplaces – text chats, demonstrations of goods and services by live video calls

Devices Fora Soft develops for

  • Web browsers
    Chrome, Firefox, Safari, Opera, Edge – applications that require no download
  • Phones and tablets on iOS and Android
    Native applications that you download from AppStore and Google Play
  • Desktop and laptop computers
    Applications that you download and install
  • Smart TVs
    Javascript applications for Samsung and LG, Kotlin apps for Android-based STBs, Swift apps for Apple TV
  • Virtual reality (VR) headsets
    Meetings in virtual rooms

What technologies to develop a custom video conference on

video chat development technologies

Basic technology to transmit video

Different technologies suit best for different tasks:

  • for video chats and conferences – WebRTC
  • for broadcasting to a big audience – HLS
  • for streaming to third-party products like YouTube and Facebook – RTMP
  • for calling to phone numbers – SIP
  • for connecting IP cameras – RTSP and RTP

Freelancer or an agency that does not specialize in video software may pick the technology they are best familiar with. It might be not the best for your tasks. In the worst case, you’ll have to throw the work away and redo it. 

We know all the video technologies well. So we choose what’s best for your goal. If you need several of these features in one project – a mix of these technologies should be used. 

WebRTC is the main technology almost always used for video conferences though. This is the technology for media streaming in real-time that works across all browsers and mobile devices people now use. Google, Apple, and Microsoft support and develop it.

WebRTC supports VP8, VP9 and H264 Constrained Baseline profile for video and OPUS, G.711 (PCMA and PCMU) for audio. It allows sending video up to 8,192 x 4,320 pixels – more than 4K. So the limitations to video stream quality on WebRTC are the internet speed and device power of the end-user. 

WebRTC video quality is better than in SIP-based video chats, as a study of an Indonesian university shows. See Figure 6 on page 9: Video test results and read the reasoning below it.

Is a media server needed for video conferencing software development?

For video chats with 2-6 participants, we develop p2p solutions. You don’t pay for the heavy video traffic on your servers.

For video conferences with 7 and more people, we use media servers and bridges – Kurento is the 1st choice. 

For “quick and dirty” prototypes we can integrate third-party solutions – ready implementations of video chats with media servers that allow slight customization. 

  • p2p video chats

P2p means video and audio go directly from sender to receivers. Streams do not have to go to a powerful server first. Computers, smartphones, and tablets people use nowadays are powerful enough to handle 2-6 streams without delays.

Many businesses do not need more people in a video conference. Telemedicine usually means just 2 participants: a doctor and a patient. The development of a video chat with a media server is a mistake here. Businesses would have to pay for the traffic going through the server not receiving any benefit.

  • Video conferences with a media server

Users cannot handle sending more than 5 outgoing video streams without lags now. People’s computers, smartphones, and tablets are not powerful enough. While sending their own video, they accept incoming streams. So for more than 6 people in video chat – each sends just 1 outgoing stream to a media server. The media server is powerful enough to send this stream to each participant.

Kurento is our first choice of media servers now for 3 reasons:

  • It is reliable.

    It was one of the first media servers to appear. So it gained the biggest community of developers. The more developers use technology the faster they solve issues, the quicker you find the answers to questions. This makes development quicker and easier, so you pay less for it.

    Twilio bought Kurento technology for $8.5 million. Now Twilio provides the most reliable paid third-party video chat solution, based on our experience.

    In 2021, other media servers have smaller developers’ and contributors’ communities or are backed by not-so-big companies, based on our experience and impression. They either are not as reliable as Kurento or do not allow developing that many functions.
  • It allows adding the widest number of custom features.

    From screen sharing to face recognition and more – we have not faced any feature that our client would want, not possible to develop with Kurento. To give developers this possibility, the Kurento contributors had to develop each one separately and polish it to a well-working solution. Other media servers did not have that much time and resources to offer the same.
  • It is free.

    Kurento is open-source. It means you may use it in your products legally for free. You don’t have to pay royalties to the technology owner.

We work with other media servers and bridges – when not that many functions are needed, or it is an existing product already using another media server:

We compare media servers and bridges regularly as all of them develop. Knowing your needs, we recommend the optimal choice.

  • Integration of third-party solutions

Third-party solutions are paid: you pay for minutes of usage. The development of a custom video chat is cheaper in the long run.

Their features are also limited to what their developers developed.

They are quicker to integrate and get a working prototype though. If you need to impress investors – we can integrate them. You get your app quicker and cheaper compared to the custom development.

However, to replace it with a custom video chat later – you’ll have to throw away the existing implementation and develop a custom one. So, you’ll pay twice for the video component.

We use these 3 -they are the most reliable ones based on our experience:

Write to us: we’ll help to pick optimal technologies for your video conference.

How much the development of a video conference costs

You’re here – means ready solutions sold as is to integrate into your existing software probably do not suit you and you need a custom one. The cost of a custom one depends on features and their complexity. So we can’t say the price before knowing these.

Take even the log in function as an example. A simple one is just through email and password. A complex one may have a login through Facebook, Google, and others. Each way requires extra effort to implement. So the cost may differ several times. And login is the simplest function for a few work hours. Imagine how much the added complexity will influence the cost of more complex functions. And you’d probably have quite a lot of functions.

Though we can give some indications.

✅ The simplest video chat component takes us 2-4 weeks and costs USD 8000. It is not a fully functioning system with login, subscriptions, booking, etc. – just the video chat with a text chat and screen sharing. You’d integrate it into your website or app and it would receive user info from there. 

✅ The simplest fully functional video chat system takes us about 4-5 months and around USD 56 000. It is built from the ground up for one platform – either web or iOS or Android for example. Users register, pick a plan, and use the system.

✅ A big video conferencing solution development is an ongoing work. The 1st release takes about 7 months and USD 280 000.
Reach us, let’s discuss your project. After the 1st call, you get an approximate estimation.

Categories
Uncategorized

How to minimize latency to less than 1 sec for mass streams?

guitar hero

Is it possible to achieve less than a second latency in a video broadcast? What if the stream goes to a thousand people? Yes. How? Let’s answer using the project WorldCast Live as an example. We did it using WebRTC. WCL streams HD concerts to an audience of hundreds and thousands of people.

Why would I reduce the broadcast latency?

Latency less than a second in a videoconference is normal, otherwise speaking is nearly impossible. For one-sided streaming, the latency of 2 or even 20 seconds is fine. For example, TV delay is about 6 sec. However, there are cases where you’d try to reduce it to 1 second.

  • Sport events

It’s highly unlikely that the user will be happy when their neighbors shout, GOAL!, and he still sees the ball somewhere in the middle of the field. What if the user is also live betting?

  • Interviews

Thanks to the pandemic, not only talks with friends have moved online, but also interviews with celebrities. Take an interviewer’s latency, add to the interviewee’s latency, and add the time spent while it all goes to the user. The more you get, the worse the experience is for all sides.

  • Concerts

The WCL player is embedded in different sites. Concerts broadcast online to all of them. If you and your friend use different sites to watch the same concert, it will negatively affect your viewing experience. Thus, the concert organizers are trying to minimize the latency. Although the user wouldn’t really care whether he hears the guitar riff now or in 2 seconds, there is a general trend on minimizing the latency. The faster the better!

And also…

The examples above combine. In WCL viewers and performers talk via a video chat. Therefore, the latency has to be like that of a video chat, as we need the minimum difference in time between the question and the answer.

How to reach low latency in live streaming?

Use WebRTC

The standard WebRTC package offers average latency of 500 ms (half a second). We could’ve finished the article here: creating a video chat where people can connect with each other isn’t difficult. What’s difficult is making it all work stable for thousands of people and customizing streams to improve their quality. 

Base WebRTC isn’t about HD streaming. Video and audio quality is enough for speaking but isn’t enough for streaming music. To decrease the latency, WebRTC can reduce the quality and skip parts of the content. To avoid it, you have to go under the hood of WebRTC, which we did.

Set up WebRTC

To ensure that the low latency doesn’t go against quality and the final user’s experience, you need to go for extra development. Here’s what we did for WCL so it can broadcasts concerts to thousands of people:

  • Enabled multi-channel audio

The standard audio in WebRTC is mono, it’s 1 channel. The stereo is 2 channels. There are 5 channels on WCL.

  • Upped bitrate

WebRTC settings limit the video transmission speed to 500 kb/s. That’s not much for action on the screen: if light colors change quickly with this limit, the quality will be lower because of trying to go inside that channel, pixels might appear. Not a great watching experience. Therefore we’ve increased the bitrate to 1,5 Gb to transmit HD video.

  • Increased discretization frequency

By this, we’ve improved the quality of audio and video, low and high frequencies. They are not lost anymore. We can’t disclose what exactly we did, but if you’re interested in that, let us know!

Scale Kurento

Kurento Media Server is an open-source WebRTC server.  Set up Master Kurento to stream video. 500 hundred people will connect directly to Master Kurento, from where they’ll get the stream. If there are more viewers, Edge Kurento is in play, to which other viewers connect. The more viewers there are on the stream, the more Edge Kurento you need to use. All together it creates a tree-like scheme.

kurento architecture
Kurento architecture

When do you leave the latency of more than a second?

When the budget is limited. WebRTC is more expensive than HLS if you need scaling. The WebRTC server on WCL costs $0,17/h which is $122,4 monthly if we take 30 days.

HLS, however, costs $0,023/h and can be turned on and off, unlike WebRTC. If we take 3 hour-long concerts a week, then we’ll spend a bit less than $0,28 for 12 concerts. Note that there are many server providers, the prices are always different, but you can see what it looks like using our project as an example.

The first stable version of subsecond latency on WCL took us 3,5 weeks. If there’s no necessity in very low latency, why spend?

TL;DR

Latency less than a second in a broadcast is necessary when there’s communication between participants or a stream where things change all the time. WebRTC makes it possible for even a thousand people if you work with the technology.

If your app is about something from here, make sure to take a look at WebRTC. Contact us, too, we’ll help! Contact us via our form or DM on Instagram, which we created not so long ago 🙂

Categories
Uncategorized

How to estimate the server cost for a video platform

Are you planning on a creating video software, such as video conference, webinar, telemedicine or elearning platform? Let’s see how much you’ll have to pay monthly to maintain it.

In this article, we’ll explain how to estimate a video streaming platform’s infrastructure cost. We’ll start with an algorithm and finish with examples. If you don’t have technical background, you can calculate using the examples.

How much does a month of a video program cost: calculation algorithm

The combined cost is based on the cost of traffic, data storing, and the server itself.

  • Traffic – the information that goes through our servers. In general, it’s audio and video streams.
  • Data storing – storing video conference recordings.
  • The server itself is the computer with the project. When one server isn’t enough because of too many users, people would use many servers instead.

Traffic costs

Let’s start with the cost of traffic: with video streaming it’s more expensive than data storage and the server.

Let’s calculate how much traffic will go through our server in a certain period of time.

We don’t count incoming traffic, only outgoing. This is because most providers either provide unlimited free incoming traffic, or their prices are so low that they can be neglected.

To calculate the amount of traffic that will be consumed when streaming a video, you need to understand the size of the video file to be streamed. Let’s assume that a file of 1 gigabyte will consume 1 gigabyte of traffic.

You can use the table below.

Traffic use depending on resolution, bandwidth and overall quality

Select a resolution. The right column shows how many gigabytes will be consumed per hour when streaming video in the selected resolution. The numbers are obtained with the help of a calculator. If the table does not have the necessary values, take the data from there.

Now that we have an idea of how many gigabytes an hour of video is, we need to calculate how many times the video will be downloaded from the server.

Suppose this is a streaming service and users watch the video on the site. Then, if two people watch the video, the traffic will be spent twice: the video will be watched 1 time by each user.

For services with one streamer and 20 viewers traffic equal to the size of 20 files will be spent per hour: the video will be transferred from the server to each of the 20 viewers.

In the case of a video chat for 4 people, the video size must be multiplied by 12. 4 people in the chat and each receives 3 videos of the other interlocutors from the server.

The size of the video multiplied by the number of downloads of that video from the server. Now we have the amount of traffic that is consumed per 1 hour of time. Multiply it by price per gigabyte of traffic provided by our provider: AWS, DigitalOcean. If the servers are physically in our possession, it can even be our ISP.

Voila, we got an approximate price for traffic. But that’s not all: there are still costs for video storage and the cost of renting the server itself.

Price for storing data

If you don’t keep your stream recordings, feel free to skip this section.

Storage costs for data that doesn’t include recorded video are usually so low that they can be neglected in the approximate cost calculation. We rent a server to host the site. A disk space of a couple dozens of GB, that’s available on the server, should be enough in most cases.

But if the data exceeds a few tens of gigabytes, it’s stored in a separate storage unit. It is paid for in addition.

Examples:

  • An app with a huge audience, like Instagram. They need extra storage, although they don’t store video call recordings because users upload a lot of photos and videos, and there are a lot of users.
  • Video conferencing application for a local bank for 3 branches, they store video conferencing recordings. They do not need additional storage: the recordings are stored at the same time in such a size that the server memory capacity of their site is not exceeded.

First, let’s calculate the size of files to be stored on the server. The algorithm for calculating the size of one video file is similar to the calculations above, but it doesn’t matter how many people view the file. All that matters is how many such videos to store.

Examples:

  • If you store a video conference recording for 4 people, you will need space equal to 4 video files – one for each participant’s video.
    You can combine all the videos into one and store it in an optimized form. The size of the files on the server would be 2 or more times smaller. But this feature needs to be developed, so discuss it with your developer.
  • If we store videos like YouTube, one file will be stored for each video. If there is functionality to change video quality depending on the user’s internet speed, one file of each quality will be stored.

Now take the size of one video, multiply it by the number of videos to be stored.

This value is multiplied by the price of storing GB of data on your server.

If it is your own computer, you can directly calculate the cost of HDD/SSD you will need to install there.

All that remains is to calculate the cost of renting or maintaining the server itself.

The cost of server

The server is the place where the program itself is located and where user data is stored. For small projects, 1 server can be used, and where there are many users, several servers can be used.

The cost of servers strongly depends on the type of project and the number of users, but you can use the tips below as an example.

In the early stages of project development, you can get by with one server for $50 per month.

If there is video conferencing or video streaming, it is advisable to have a separate server for the media server.

If there are many conferences, you can automatically create new servers on the fly for the duration of the video conference and delete them when not needed. They have a price per hour of use. If you know that price for your server, use it. If not, you can charge about $0.3 per hour for each video conference. This is the average price for the first few Amazon EC2 c5 servers from Amazon’s September 2020 calculator.

Examples of video software monthly cost calculation

#1: Video conference for 9 people

Traffic

Let’s take the resolution 640×480 – the optimal format for video chat (4:3) – the picture is composed and the face is in the middle of the screen.

According to the table in the “Traffic Costs” section above, this is 0.11 GB per hour. This is what 1 hour of video takes one way. A total of 72 streams will be sent from the server – 8 videos for each of 9 users.

Multiply 72 by 0.11 GB, you get 7.92 GB.

Traffic on AWS costs differently depending on the amount of traffic and where clients are physically located. For our approximate calculations, an average cost of $0.09 per GB will do. Multiply 7.92 GB by $0.09 and you get $0.71 per hour.

Storing

Let’s say we want to store video conference recordings. One video will be stored on the server where all other videos will be merged.

Let’s assume the new video will have a resolution of 1280×720 (this is enough to see everything) and will take 0.88GB according to the table.

We take the General Purpose SSD from Amazon S2 from Amazon S2. It costs about $0.1 per gigabyte per month as of February 2022. Multiply the video size in GB by the cost to store one GB (0.88 * 0.1), we get ~$0.088 for each hour of video. This will have to be paid every month that our video is stored on the server.

Server

Here you can take $50 for the server for the platform itself and USD 0.30 for each hour of server rental for the video conference itself.

A server that costs $0.30 can host 5 or more conferences at the same time, but for pessimistic estimates we can assume we have only one video conference.

The total cost is

  • $0.71 per hour for traffic.
  • $0.088 per month for storage of one hour of recordings
  • $50 a month for the server for the site and $0.30 an hour for renting a server for the videoconference itself

$0.71 and $0.30 can be added: both prices are for each hour of an active video conference. You get $1.01 per hour of videoconferencing.

How to calculate how much a month it will cost to maintain such a chat? You need at least an approximate number of videoconferences per day/month.

Let’s say we have 4 hours of video conferences a day. Multiply 4 hours by $1.01, you get the cost per day: $4.04. Multiply the cost of daily service for 30 days, you get the cost per month: $121.20.

To store all the recordings, we would need to store 4 hours * 30 days = 120 hours of video. We will pay $0.088 * 120 = $10.56 a month.

Let’s add up all monthly costs: $121.20 + $10.56 + $50 = $181.76

#2: Webinar with 2 streamers and 50 viewers

Let’s take 720×480 resolution – a suitable option for streaming. A good balance between quality and economy.

According to the table from the section “Traffic costs” above, it’s 0.44 GB per hour. That’s what 1 hour of video takes one way. There will be 102 video streams sent from the server – 2 videos for each of 50 users and 1 video for each streamer.

Multiply 102 by 0.44 GB, you get 44.88 GB.

Traffic on AWS costs differently depending on the amount of traffic and where clients are physically located. For our approximate calculations, an average cost of $0.09 per GB will do. Multiply 44.88 GB by $0.09 and you get $4.04 per hour.

Storing

We have two streamers. The server will store one video with videos of both streamers merged.

Suppose the new video will have a resolution of 1280×720 (this is enough to see everything) and according to the table will take 0.88GB.

We take the General Purpose SSD from Amazon S2. It costs about $0.1 per gigabyte per month as of February 2022. 0.88 * 0.1, we’ll pay ~$0.088 for every hour of video each month it’s stored on our server.

Server

Here we can take $50 for the server under the platform itself and $0.30 for each hour of video streaming.

In fact, a server that costs $0.3 can host more than one streaming session at a time and there may be more than 50 viewers, but for pessimistic calculations, we assume that we only have one streaming session. We will still have to use such a server for that.

Total costs

  • $4.04 for each hour of streaming traffic.
  • $0.088 per month for storage of one hour of records
  • $50 a month for the server under the site and $0.30 per server for each hour of streaming

$4.04 and $0.30 can be added because both prices are per streaming hour. You get $4.34 for every hour of streaming.

How do we calculate how much it will cost per month? We need at least the approximate number of streaming sessions per day or month.

Suppose we have 4 hours of streaming a day. Multiply 4 by $4.34 and you get the cost per day: $17.36. Multiply by 30 and you get the cost per month: $520.8.

To store all the recordings we will need to store 4 hours * 30 days = 120 hours of video. We will pay $0.088 * 120 = $10.56 per month.

Let’s add up all monthly costs: $520.8 + $10.56 + $50 = $581.36.

#3: 1-on-1 video chats via p2p

Traffic

In the case of p2p calls, all video goes directly between users, bypassing the server. Therefore, you don’t have to pay for traffic.

This is true when a p2p connection is technically possible to establish. It is not possible only in about 10% of cases. In those cases the connection will go through the server and not directly. So 90% of the calls will be free and 10% will have to be paid for. Let’s calculate how much exactly.

Let’s take the resolution 640×480 – the optimal format for video chat (4:3) – the picture is composite, and the face in the middle of the screen. According to the table in “Traffic costs” above, this is 0.11 GB per hour. This is what 1 hour of video takes one way.

In total 2 streams will be sent from the server – 1 video for each of 2 users.

Multiply 2 by 0.11 GB, you get 0.22 GB.

Traffic on AWS costs differently depending on the amount of traffic and where clients are physically located. For our approximate calculations, an average cost of $0.09 per GB will do. Multiply 0.22 GB by $0.09 and you get $0.02 per hour.

You should pay only for 10% of calls so 1 hour of video chat will cost $0.002.

Storing

We have 2 people in the chat. There will be 1 video stored on the server where the videos of these two people are combined into one.

Let’s assume the new video will have a resolution of 1280×720 (this is enough to see everything) and will take 0.88GB according to the table.

We take the General Purpose SSD from Amazon S2. It costs about $0.1 per gigabyte per month as of February 2022. 0.88 * 0.1, we’ll pay ~$0.088 for every hour of video each month it’s stored on our server.

Server

You can take $50 for the server and not take an additional server for video streaming, as the server for $50 will be able to handle more than 1000 simultaneous p2p chats.

Total costs

  • $0.0002 per each hour of video chats
  • $0.088 per month for storage of one hour of recordings
  • $50 per month for a server under the site

How do I calculate how much it will cost per month? You need at least an approximate number of video chats per day/month.

Let’s suppose we have 100 chats per day.

Multiply 100 by $0.002 to get the cost per day: $0.2. Multiply by 30 to get the cost per month: $6.

To store all the recordings you would need to store 3000 hours of video. We will pay $0.088 * 3000 = $264 per month.

Add all costs per month and you get $56 without all records and $320 with all records on server.

Calculations are very rough. We give them to get at least a rough idea of the cost of running a video program: it costs about $10 or $1000 or $10,000. If, after the calculations, your business comes out to 0 – do not start it. The margin should be substantial.

Have any more questions? Contact us using the contact form and we’ll get back to you with a response ASAP!