Categories
Uncategorized

How to Implement Silence Trimming Feature to Your iOS App

How to Implement Silence Trimming Feature to Your iOS App

The success of a multimedia platform often depends on the quality of its content. Therefore, there are numerous tools for content management, such as manual or automatic moderation to identify offensive content, copyright infringements, or automatic/manual processing of uploaded content. One of the methods of automatically enhancing content is silence trimming.

The task of isolating voice or sounds is quite challenging. In this article, we will explain how we explored solutions for this challenge during the development of our project, the BlaBlaPlay voice chat, and how to implement it in your iOS application.

[Theory] When do we need silence trimming?

For instance, you might have an audio file with segments where the volume is too low for human perception. In such cases, you may want to remove these segments from the file. Or, when recording an audio message, you might have some silent moments in the beginning while gathering your thoughts. To avoid rewinding through several seconds of silence repeatedly, it’s easier to trim them out. There are many similar cases, but the conclusion is the same—silence trimming improves the content and its perception quality.

We can implement it in different ways using various tools. Initially, we can identify two main groups of tools:

1. Manual removal—any audio editor has a basic function to select a segment for deletion or retention.

2. Automatic removal—these tools use auxiliary technologies to achieve the desired result. Let’s explore them in more detail.

[Theory] Automatic methods of silence detection

There are various methods, and the choice depends on the specific task faced by the developer. This is mainly because some tools allow isolating only the voice from the audio stream, while others work with both voice and background noises.

Detection based on sound level

Detection based on the sound level, or more precisely, its value, is the simplest and quickest method to implement. Therefore, it can be used for real-time audio streaming to identify silence. However, it is also the least accurate and fragile method of silence detection. The technology is straightforward:

1. Set a constant value in decibels, approximately equal to the threshold of human audibility.

2. Anything below this threshold is automatically considered silence and subject to trimming.

oscillogram

This method is suitable only when we need to identify absolute silence, and there’s no need to detect voice or any other background noises. Since absolute silence is rare, this method is not quite effective. Consequently, we can only use it for indicating the presence or absence of sound, not for processing silence.

Isolating voice from the audio stream

This approach works the opposite way—if there is speech, there is no silence.

Extracting sound from an audio stream is a non-trivial task and we can achieve it by evaluating the fragment’s sound levels or its spectrogram—a plot of sound level oscillations. There are two approaches to evaluation: analytical- and neural network-based. In our application, we used a Voice Activity Detector (VAD)—a speech detector to isolate voice from noise or silence. Let’s consider it as an example.

Analytical approach

When working with speech signals, frequency-time domain processing is usually employed. Among the main methods are:

The most effective method for voice extraction relies on the fact that the human speech apparatus can generate specific frequency bands known as “formants.” In this method, the input data consists of a continuous oscillogram (a curve representing oscillations) of the sound wave. To extract speech, it is divided into frames—sound stream fragments with durations ranging from 10 to 20 ms, with a 10 ms step. This size corresponds to the speed of human speech: on average, a person pronounces three words in three seconds, with each word having around four sounds, and each sound is divided into three stages. Each frame is transformed independently and subjected to feature extraction.

Dividing the oscillogram into frames
Dividing the oscillogram into frames

Next, for each window, a Fourier transformation is performed:

  1. Peaks are found.
  2. Based on their formal features, a decision is made: whether there is speech signal or not. For a more detailed process, refer to the work by Lee, 1983, “Automatic Speech Recognition.”

Neural Network Approach to the Assessment

The neural network approach consists of two parts. The so-called feature extractor is a tool for extracting features and building a low-dimensional space. The input to the extractor is an oscillogram of a sound wave, and, for example, using Fourier transformation, its low-dimensional space is constructed. This means that key features are extracted from a large number of features and formed into a new space.

Transforming high-dimensional space to low-dimensional space.
Transforming high-dimensional space to low-dimensional space.

Next, the extractor organizes sounds in space so that similar ones are close together. For example, speech sounds will be grouped together but placed away from sounds of drums and cars.

sounds grouping scheme
Sound grouping scheme

Then, a classification model takes the output data from the feature extractor and calculates the probability of speech among the obtained data.

What to use?

The process of extracting sounds or speech is complex and requires a lot of computational resources for fast operation. Let’s understand when and which method should be applied.

Speech recognition approaches
Speech recognition approaches

So, as we can see, signal-level detection won’t be suitable if you need accuracy. Both analytical and neural network approaches have their nuances. Both require high computational power, which limits their use with streaming audio. However, in the case of the analytical approach, this problem can be addressed by using simpler implementations at the cost of accuracy. For example, WebRTC_VAD may not be highly accurate, but it works quickly with streaming audio, even on low-powered devices.

On the other hand, if you have sufficient computational power and you want to detect not only speech but also sounds like birds, guitars, or anything else, a neural network will solve all your problems with high accuracy and within an acceptable time frame.

[Practice] Example for detecting and trimming silence in iOS audio recordings

All iPhones are sufficiently powerful, and Apple’s frameworks are optimized for these devices. Therefore, we can confidently use neural networks to detect silence in streaming audio, employing a reverse approach: where there are no sounds — there is silence. For detection, we will use the Sound Analysis framework, and for audio recording and trimming — AVFoundation.

Receiving and sending audio buffers during recording

To capture audio buffers from the audio recording stream, we need an AVAudioEngine object. And to deliver the received buffers, we need to add an observer to the output of the connected audio node.

private var audioEngine: AVAudioEngine?

public func start(withSoundDetection: Bool) {
            guard let settings = setupAudioSession() else { return }
            do {
                let configuration = try configureAudioEngine()
                audioEngine = configuration.0
                audioRecordingEvents.onNext(.createsAudioFromat(audioFormat: configuration.1))
            } catch {
                audioRecordingEvents.onError(AudioRecordError.creatingAudioEngineError)
                print("Can not start audio engine")
            }
        configureAudioRecord(settings: settings)
    }


private func configureAudioEngine() throws -> (AVAudioEngine, AVAudioFormat) {
        let audioEngine = AVAudioEngine()
        let inputNode = audioEngine.inputNode
        let recordingFormat = inputNode.outputFormat(forBus: 0)
        inputNode.installTap(onBus: 0, bufferSize: 4096, format: recordingFormat) { [weak self] buffer, time in
            self?.audioRecordingEvents.onNext(.audioBuffer(buffer, time))
        }
        audioEngine.prepare()
        try audioEngine.start()
        return (audioEngine, recordingFormat)
    }

Processing the buffer in the neural network and obtaining the result

The Sound Analysis framework comes with built-in recognition for 300 sounds, which is more than sufficient for our task. Let’s create a classifier class and properly configure the SNClassifySoundRequest object.

final class AudioClassifire
    private var analyzer: SNAudioStreamAnalyzer?
    private var request: SNClassifySoundRequest?

 init() {
        request = try? SNClassifySoundRequest(classifierIdentifier: .version1)
        request?.windowDuration = CMTimeMakeWithSeconds(1.3, preferredTimescale: 44_100)
        request?.overlapFactor = 0.9
    }
}

When creating the SNClassifySoundRequest, it is crucial to use a non-zero overlapFactor value when using a constant windowDuration. The overlapFactor determines how much the windows overlap during analysis, creating continuous context and coherence between the windows.

Next, we need a class of observer that conforms to the SNResultsObserving protocol. All classification results will be sent to this observer.

enum AudioClassificationEvent {
    case result(SNClassificationResult)
    case complete
    case failure(Error)
}


protocol AudioClassifireObserver: SNResultsObserving {
    var audioClassificationEvent: PublishSubject<AudioClassificationEvent> { get }
}


final class AudioClassifireObserverImpl: NSObject, AudioClassifireObserver {
    private(set) var audioClassificationEvent = PublishSubject<AudioClassificationEvent>()
}


extension AudioClassifireObserverImpl: SNResultsObserving {
    func request(_ request: SNRequest, didProduce result: SNResult) {
        guard let result = result as? SNClassificationResult else { return }
        audioClassificationEvent.onNext(.result(result))
    }
    
    func requestDidComplete(_ request: SNRequest) {
        audioClassificationEvent.onNext(.complete)
    }
    
    func request(_ request: SNRequest, didFailWithError error: Error) {
        audioClassificationEvent.onNext(.failure(error))
    }
}

Once the observer is created, we can create the stream for analyzing incoming audio buffers—SNAudioStreamAnalyzer..

func configureClassification(audioFormat: AVAudioFormat) -> AudioClassifireObserver? {
        guard let request else { return nil}
        let observer = AudioClassifireObserverImpl()
        analyzer = SNAudioStreamAnalyzer(format: audioFormat)
        try? analyzer?.add(request, withObserver: observer)
        self.observer = observer
        return observer
    }

Now everything is ready. We can receive audio buffers and send them for analysis to the neural network, which is straightforward to do:

func addSample(_ sample: AVAudioPCMBuffer, when: AVAudioTime) {
        analysisQueue.async {
            self.analyzer?.analyze(sample, atAudioFramePosition: when.sampleTime)
        }
   }

After the AVAudioPCMBuffer is successfully recognized, an event AudioClassificationEvent.result(SNClassificationResult) will be received in AudioClassifierObserver.audioClassificationEvent. It will contain all recognized sounds and their confidence levels. If there are no sounds, or their confidence is less than 0.75, we can consider that the sound was not recognized, and we can ignore the result. This can be determined as follows:

func detectSounds(_ result: SNClassificationResult) -> [SNClassification] {
        return result.classifications.filter { $0.confidence > 0.75 }
    }

Trimming the audio file based on sound detection

Once the recording starts, and the first audio buffers are sent to the neural network for analysis, we need to start a timer. It will count the time until the first non-zero results appear. Note that the first results will not be obtained earlier than request?.windowDuration = CMTimeMakeWithSeconds(1.3, preferredTimescale: 44_100). So the initial timer values should take this into account.

private var recordSilenceTime: Double = 0.6
private var silenceTimer: DispatchSourceTimer?

private func processAudioClassifireResult(_ result: SNClassificationResult) {
        let results = audioClassifire.detectSounds(result)
        guard !results.isEmpty && !currentState.classificationWasStopped && silenceTimer == nil else {
            startSilenceTimer()
            return
        }
        results.forEach {
            print("Classification result is \($0.description) with confidence: \($0.confidence)")
        }
        stopObservingClassifire()
    }

When the first non-zero results appear, we stop the timer and, based on recordSilenceTime, we can trim a portion from the beginning of the audio recording.

private func processRecordedAudio(fileName: String, filesPath: URL) {
if recordSilenceTime > 0.6,
               let trimmedFile = fileName.components(separatedBy: ".").first {
                let trimmer = AudioTrimmerImpl()
                trimmer.trimAsset(AVURLAsset(url: url), fileName: "\(trimmedFile)", trimTo: recordSilenceTime) { [weak self] url in
                    DispatchQueue.global().asyncAfter(deadline: .now() + 0.2) {
                        let record = AVURLAsset(url: url)
                    }
                }
            }
}

File trimming is done using AVAssetExportSession.

    func trimAsset(_ asset: AVURLAsset, fileName: String, trimTo: Double, completion: @escaping (String) -> Void) {
        let trimmedSoundFileURL = documentsDirectory.appendingPathComponent("\(fileName)-trimmed.mp4")
        do {
            if FileManager.default.fileExists(atPath: trimmedSoundFileURL.absoluteString) {
                try deleteFile(path: trimmedSoundFileURL)
            }
        } catch {
            print("could not remove \(trimmedSoundFileURL)")
        }
        
        print("Export to \(trimmedSoundFileURL)")
        
        if let exporter = AVAssetExportSession(asset: asset, presetName: AVAssetExportPresetPassthrough) {
            exporter.outputFileType = AVFileType.mp4
            exporter.outputURL = trimmedSoundFileURL
            exporter.metadata = asset.metadata
            
            let timescale = asset.duration.timescale
            let startTime = CMTime(seconds: trimTo, preferredTimescale: timescale)
            let stopTime = CMTime(seconds: asset.duration.seconds, preferredTimescale: timescale)
            exporter.timeRange = CMTimeRangeFromTimeToTime(start: startTime, end: stopTime)
            
            exporter.exportAsynchronously(completionHandler: {
                switch exporter.status {
                case  AVAssetExportSession.Status.failed:
                    if let error = exporter.error {
                        print("export failed \(error)")
                    }
                case AVAssetExportSession.Status.cancelled:
                    print("export cancelled \(String(describing: exporter.error))")
                default:
                    print("export complete")
                    completion(fileName)
                }
            })
        } else {
            print("cannot create AVAssetExportSession for asset \(asset)")
        }
    }

Results and Conclusion

Detecting silence or sounds in audio recordings becomes more accessible for applications that are not primarily specialized in professional processing without sacrificing the efficiency and accuracy of the obtained results. Apple already provides ready-made tools for this, so there’s no need to spend a year or more developing such functionality manually. Love neural networks.

Check out how it works in our BlaBlaPlay app or contact us to implement the silence trimming feature in your iOS application.

Categories
Uncategorized

Picture-in-Picture Mode On iOS: Implementation and Peculiarities

pip on ios code examples

The Picture-in-Picture (PiP) mode allows users to watch videos in a floating window on top of other windows. They can move it around the screen and place in any convenient spot. This feature enables users to keep an eye on what they are watching while interacting with other websites or applications.

We have previously covered the PiP implementation on Android with code examples. In this article, we will focus on iOS.

Picture in Picture is a must-have feature for modern multimedia applications

Here’s why:

1. Enhances multitasking. PiP allows users to simultaneously watch videos or view images in a small window while maintaining access to the main content or application interface. This enables users to multitask, e.g. watch a video while checking emails, sending messages, or browsing social media.

2. Improves user experience. The mode provides more flexible and convenient app navigation. This significantly enhances the user experience by eliminating the need to interrupt content playback or switch contexts completely.

3. Minimizes session interruptions. PiP enables users to continue watching or tracking content while performing other tasks. This helps reduce interruptions and ensures a smoother and uninterrupted workflow. For example, a user can watch a tutorial or a YouTube livestream while searching for information on the Internet or taking notes.

All these factors help retain users within the application and increase the duration of app usage sessions.

Peculiarities and difficulties of PiP on iOS

Apple envisages two scenarios for using PiP on iOS:

1. For video content playback

2. For video calls

The main issue is that for the video call scenario, prior to iOS 16 and for iPads that do not support Stage Manager, it is necessary to request special permission from Apple to access the camera in multitasking mode (com.apple.developer.avfoundation.multitasking-camera-access). But even after waiting for several months, as in our case, Apple may still not grant these permissions.

Therefore, in our mobile video chat app, Tunnel Video Calls, we decided not to use such a scenario. Instead, we adopted the approach where a video call and its content are considered as video playback.

PiP lifecycle on iOS

The Picture-in-Picture mode is essentially the content exchange between a full-screen app and PiP content from another app. The lifecycle of this exchange can be schematically represented as follows:

transitioning to pip on ios
Stages of transitioning to the PiP mode on iOS

1. Video is playing in full-screen mode.

2. The user initiates an event that triggers the transition to PiP mode, such as pressing a specific button or minimizing the app.

3. An animation is launched to transition the video to PiP mode — the full-screen video shrinks into a thumbnail and moves to the corner of the screen.

4. The transition process completes, and the application changes its state to the background state.

Then, when it is necessary to bring the video back to full-screen mode from PiP mode, the following steps occur:

1. The app is in the background state and displays PiP.

2. An event occurs that initiates the transition from PiP to full-screen mode and stops the Picture in Picture mode – such as pressing a button or expanding the app. The app enters the foreground.

3. An animation is launched to transition the video to full-screen mode. The app enters the state of displaying the video in full-screen.

Here’s how it looks:

transitioning back from pip on ios
Steps to exit Picture in Picture mode on iOS

Implementing PiP for video playback

To enable PiP, you need to create an AVPictureInPictureController(playerLayer: AVPlayerLayer) object, and it must have a strong reference.

if AVPictureInPictureController.isPictureInPictureSupported() {
    // Create a new controller, passing the reference to the AVPlayerLayer.
    pipController = AVPictureInPictureController(playerLayer: playerLayer)
    pipController.delegate = self
    pipController.canStartPictureInPictureAutomaticallyFromInline = true
}

Next, you need to start playing the video content.

func publishNowPlayingMetadata() {
    nowPlayingSession.nowPlayingInfoCenter.nowPlayingInfo = nowPlayingInfo
    nowPlayingSession.becomeActiveIfPossible()
}

After this, when the button is pressed or the app is minimized/expanded, the PiP will activate:

func togglePictureInPictureMode(_ sender: UIButton) {
    if pipController.isPictureInPictureActive {
        pipController.stopPictureInPicture()
    } else {
        pipController.startPictureInPicture()
    }
}

Implementing PiP for video calls

Implementing Picture in Picture mode in an iOS app for video calls using WebRTC technology is perhaps the most challenging part of the work. We would be happy to help you with it, so please reach out to us to discuss the details. Conceptually:

In this implementation, the camera will not capture the user’s image, and you will only be able to see the conversation partner.

To achieve this, you need to:

1. Create an AVPictureInPictureController object.

2. Obtain the RTCVideoFrame.

3. Retrieve and populate CMSampleBuffer based on RTCVideoFrame.

4. Pass the CMSampleBuffer and display it using AVSampleBufferDisplayLayer.

Here’s a sequence diagram illustrating the process:

sequence diagram pip on ios
PiP on iOS sequence diagram

Categories
Uncategorized

How to Teach Your iOS App Recognize Tone of Voice

speech recognition with neural network on iPhone app

Learn more about neural networks and how they work in general in one of our previous materials with comics. This article is a guide to how to work with them on iOS. And in particular, how to implement speech recognition and characterize what’s recognized. With code examples.

We will explain it with an example /solution that we have developed for one of our projects. Starting with quick intro into the framework we will be using, we’ll then proceed to creating a model, training it with your app data, and analyzing results.

What framework to use: about CoreML

How to determine the phrase toxicity on iOS in real time?

  Phase 1: preparing the data for the speech classification model training

     How to tell everything is ready and works correctly?

  Phase 2: receiving the audio signal and sending it for speech recognition

  Phase 3: speech classification

    How to enhance the results accuracy?

Alternative ways to get MLModel

What framework to use: about CoreML

CML (Core Machine Learning) — is an Apple framework for implementing machine learning to an iOS app. Apple built it in 2016 as a supplement to what they had on working with matrixes and vector algebra (together building up the Accelerate framework) and computing based on the Metal graphic technology — core neural network tools.

Neural network frameworks hierarchy
Neural network frameworks hierarchy: top level uses the results from the bottom layers.

CoreML has nothing to do with neural network training. It is only able to import a ready-made, trained model and provide the developer with a user-friendly interface to work with it in the application. For example, we submit the text to the input of the ML model and get its classification at the output.

Simplified text classifier scheme
Simplified text classifier scheme

For that CoreML integrates a fully trained model, it provides a powerful flexible tool for working with neural networks. It is possible to import almost all popular neural networks:

  • BERT, GPT — for tasks with natural language, the one we speak every day,
  • neural networks for image classification, etc.

There’s just one limitation: the number of tensor components must be <= 5. That is, no more than 5 dimensions.

We should mention what a neural network model is. This is the result of neural network training that contains a weighted graph with the best combination of weights. And gives out a result at the output.

How to determine the phrase toxicity on iOS in real-time?

You can apply the algorithm below to characterizing speech in general. But to exemplify, we’ll focus on the toxicity. 

So, to determine the toxicity of a phrase, you need to divide the problem into several phases:

1. Prepare training data with toxic and non-toxic phrases;

2. Obtain a model of the neural network trained on the data set;

3. Write down the phrase;

4. Send the phrase to the SFSpeechRecognition library for voice analysis and get the phrase in text;

5. Send the text of the trained model to the classification and get the result.

To describe it with a diagram, the problem is as follows:

Speech recognition model scheme
Speech recognition model scheme

Now each phase in detail.

Phase 1: preparing the data for the speech classification model training

To get a trained model you can go two ways:

1. Develop the neural network yourself and train the model;

2. Take a ready-made model and ready-made neural network, train it on your own data set, use Python as a tool, for instance.

To simplify the process, we’ll go the second way. Apple has an excellent set of tools for this, so starting with Xcode 13, the debugging process of the model became as simple as possible.

To begin with, launch the CreateML tool (it’s already in XCode) and create a new project. Select TextClassification (Apple uses BERT) and create a project. You’ll see a window for uploading the prepared data.

On the output, the tool accepts two datasets:

  • the set for the model to complete its learning;
  • the set to compare the results to.

All the data must be in json or csv. The dataset structure should follow the template:

For json:

[
   {
       "text": "The movie was fantastic!",
       "label": "positive"
   }, {
       "text": "Very boring. Fell asleep.",
       "label": "negative"
   }, {
       "text": "It was just OK.",
       "label": "neutral"
   } ...
]

For csv:

text,label
"The movie was fantastic!",positive
"Very boring. Fell asleep.",negative
"It was just OK.",neutral

The data is ready, now you can upload and start training the model:

Starting a new training
Starting a new training

How to tell everything is ready and works correctly?

To evaluate the results, there are reports for each learning project:

Training results report
Training results report

Precision — how well the model identifies the target (in our case the target is the phrase to characterize), with with no false-alarm.

Recall — how correctly the model identifies the target. 

F1 score — an indicator that combines the accuracy and complexity of the algorithm. Here’s how you calculate it:

F1 score formula
F1 score formula

The higher the Precision and Recall, the better. However, in reality, it is impossible to reach the maximum of both indicators at the same time.

All you have left to do is to export the received model, in *.mlmodel.

Phase 2: receiving the audio signal and sending it for speech recognition

On iOS it is the Speech framework translates voice into text. There’s a trained model in it already. Since our main task is to translate speech to text in real time, the first thing to do is to get the samples of the AVAudioPCMBuffer audio signal and them to the recognizer.

class AudioRecordService {
   private var audioEngine: AVAudioEngine?
   func start() {
            do {
                audioEngine = try configureAudioEngine()
            } catch {
                audioRecordingEvents.onNext(.error(.startingAudioEngineError))
            }
    }
    private func configureAudioEngine() throws -> AVAudioEngine {
        let audioEngine = AVAudioEngine()
        let inputNode = audioEngine.inputNode
        let recordingFormat = inputNode.outputFormat(forBus: 0)
        inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { [weak self] buffer, _ in
            self?.audioRecordingEvents.onNext(.audioBuffer(buffer))
        }
        audioEngine.prepare()
        try audioEngine.start()
        return audioEngine
    }
}

Set a zero-bus branch and the samples will arrive as soon as the audio frame number is 1024. By the way, a AVAudioNode object can potentially have several input and output buses.

Send the received buffer for speech recognition:

Create an enumeration for error processing

enum SpeechReconitionError {
    case nativeError(String)
    case creatingTaskError
}

Create an enumeration for recognition events

enum SpeechReconitionEvents {
    case phrase(result: String, isFinal: Bool)
    case error(SpeechReconitionError)
}

Create a SFSpeechRecognizer object

private var request: SFSpeechAudioBufferRecognitionRequest?
private var reconitionTask: SFSpeechRecognitionTask?
private let recognizer: SFSpeechRecognizer?
 init() {
        recognizer = SFSpeechRecognizer(locale: Locale.preferredLanguages[0])
    }

Configure recognizer and launch the recognition task

    func configureRecognition() {
        request = SFSpeechAudioBufferRecognitionRequest()
        if #available(iOS 16.0, *) {
            request?.addsPunctuation = true
        }
       if let supports = recognizer?.supportsOnDeviceRecognition, supports {
            request?.requiresOnDeviceRecognition = true
        }
        request?.shouldReportPartialResults = true
        guard let request else {
            stopRecognition()
            events.onNext(.error(.creatingTaskError))
            return
        }
  reconitionTask = recognizer?.recognitionTask(with: request, resultHandler: recognitionTaskHandler(result:error:))
    }

The function to add audio buffers to the recognition queue

  func transcribeFromBuffer(buffer: AVAudioPCMBuffer) {
        request?.append(buffer)
    }

Configure the results processor

    private func recognitionTaskHandler(result: SFSpeechRecognitionResult?, error: Error?) {
        if let result = result {
            events.onNext(.phrase(result: result.bestTranscription.formattedString, isFinal: result.isFinal))
            if result.isFinal {
                eraseRecognition()
            }
        }
        
        if let error {
            events.onNext(.error(.nativeError(error.localizedDescription)))
            return
        }
    }
    
    private func eraseRecognition() {
        reconitionTask?.cancel()
        request = nil
        reconitionTask = nil
    }

The recognition process will start immediately after configureRecognition(). Then transfer the resulting audio buffers to transcribeFromBuffer(buffer: AVAudioPCMBuffer). 

The recognition process takes about 0.5-1 seconds. Therefore the result comes asynchronously in the function ecognitionTaskHandler(result: SFSpeechRecognitionResult?, error: Error?). SFSpeechRecognitionResult and contains the results of recognition of the last audio buffer, as well as the results of all previous recognitions. That is, on the screen the user sees the last recognized sentence and everything that was recognized earlier. 

Also, recognition doesn’t always occur directly on the device. When offline recognition is not available, AVAudioPCMBuffer samples are sent to and processed on Apple servers. To verify and enforce the offline mode, use the following command:

 if let supports = recognizer?.supportsOnDeviceRecognition, supports {
            request?.requiresOnDeviceRecognition = true
 }

Apple claims the on-device results are worse. But there are limits for using it online.

Recognition results comparison: server vs on-device
Recognition results comparison: server vs on-device. Source: Apple Tech Talks

Phase 3: speech classification

Note: the main rule to using neural networks for speech classification is the more context there is, the better the accuracy.

First things first, import the ML model to the project as a regular file. Next, create an instance of the model class. The file name will be the class name.

init?() {
        do {
            let config = MLModelConfiguration()
            config.computeUnits = .all
            if #available(iOS 16, *) {
                config.computeUnits = .cpuAndNeuralEngine
            }
            mlModel = try ToxicTextClassificatorConditionalAlgoritm(configuration: MLModelConfiguration()).model
            if let mlModel {
                predicator = try NLModel(mlModel: mlModel)
            }
        } catch {
            print("Can not initilaize ToxicTextClassificatorConditionalAlgoritm")
            return nil
        }
    }

NLModel — is the object you’ll further work with.

Once created, the model is ready to accept input text for classification.

List the possible outcomes of the classification.

enum PredictResult: String {
    case toxic
    case positive
}

Now try to get the result!

 func predictResult(phrase: String) -> PredictResult? {
        guard let predict = predicator?.predictedLabel(for: phrase),
              let result = PredictResult(rawValue: predict) else { return nil }
        return result
   }

We analyze the phrase in real time. This means that the pieces that obtained in the second phase immediately fall into the classification. Because of this, the accuracy of the classification is inevitably lost. 

How to enhance the results accuracy?

а) If there is no punctuation, classify the text as it comes after the recognition and record the result. To do this, write a function that will accept the recognized text and flag that speech recognition is over.

Reminder: the phase will contain more words each time for SFSpeechRecognitionResult returns the recognition results of the last audio buffer recognition along the results of all previous recognitions. 

func analyze(phrase: String, isFinalResult: Bool) {
        guard let predict = predictResult(phrase: phrase) else {
            if isFinalResult, let result = predictResult {
                event.onNext(.finalResult(result))
            }
            return
        }
        predictResult = predict
    }

b) If there’s no punctuation* but you need to reduce the overhead for classification, you can only take the last N words from the sentence. However, this would greatly reduce the accuracy of the classification.

*To add automatic punctuation placement (currently only available in English):

 if #available(iOS 16.0, *) {
            request?.addsPunctuation = true
  }

To improve accuracy and reduce computation overhead, you can use the algorithm to divide text into sentences in proportion. For example, if there are 3 sentences in the text, do 2:1 or 1:2. That is, analyze the first 2 sentences first, then 1 remaining or first 1, then 2. 

Toxicity recognition results
Toxicity recognition results

Note: It’s necessary to request access to the mic and the permission for speech recognition.

Alternative ways to get MLModel

  1. Tool set for Python CoreML tools, that allows converting models trained with other neural networks to mlmodel:
    • CoreMl tools for TensorFlow
    • CoreMl tools for PyTorch
  2. TensorFlow Lite for iOS. It allows working with models trained with TensorFlow.

You can use neural networks for a plethora of different solutions. See how we work with it when developing video surveillance systems.

Categories
Uncategorized

How To Implement Screen Sharing in iOS App using ReplayKit and App Extension

Intro

Screen sharing – capturing user’s display and demonstrating it to peers during a video call. 

There’re 2 ways how you can implement screen sharing into your iOS app:

  1. Screen sharing in app. It suggests that a user can only share their screen from one particular app. If they minimize the app window, broadcasting will stop. It’s quite easy to implement.
  2. Screen sharing with extensions. This approach enables screen sharing from almost any point of the OS: e.g. Homescreen, external apps, system settings. But the implementation might be quite time-consuming.

In this article, we’ll share guides on both.

Screen sharing in app

Starting off easy – how to screen share within an app. We’ll use an Apple Framework, ReplayKit.

import ReplayKit

class ScreenShareViewController: UIViewController {

		lazy var startScreenShareButton: UIButton = {
        let button = UIButton()
        button.setTitle("Start screen share", for: .normal)
        button.setTitleColor(.systemGreen, for: .normal)
        return button
    }()
    
    lazy var stopScreenShareButton: UIButton = {
        let button = UIButton()
        button.setTitle("Stop screen share", for: .normal)
        button.setTitleColor(.systemRed, for: .normal)
        return button
    }()

		lazy var changeBgColorButton: UIButton = {
        let button = UIButton()
        button.setTitle("Change background color", for: .normal)
        button.setTitleColor(.gray, for: .normal)
        return button
    }()
    
    lazy var videoImageView: UIImageView = {
        let imageView = UIImageView()
        imageView.image = UIImage(systemName: "rectangle.slash")
        imageView.contentMode = .scaleAspectFit
        return imageView
    }()
}

Here we added it to the ViewController where recording, background color change buttons and imageView are – this is where the captured video will appear later.

the ViewController

To capture the screen, we address the RPScreenRecorder.shared() class and then call startCapture(handler: completionHandler:).

@objc func startScreenShareButtonTapped() {
		RPScreenRecorder.shared().startCapture { sampleBuffer, sampleBufferType, error in
				self.handleSampleBuffer(sampleBuffer, sampleType: sampleBufferType)
            if let error = error {
                print(error.localizedDescription)
            }
        } completionHandler: { error in
            print(error?.localizedDescription)
        }
}

Then the app asks for a permission to capture the screen: 

the permission pop-up

ReplayKit starts generating a CMSampleBuffer stream for each media type – audio or video. The stream contains the media fragment itself – the captured video – and all necessary information. 

func handleSampleBuffer(_ sampleBuffer: CMSampleBuffer, sampleType: RPSampleBufferType ) {
        switch sampleType {
        case .video:
            handleVideoFrame(sampleBuffer: sampleBuffer)
        case .audioApp:
//             handle audio app
            break
        case .audioMic:
//             handle audio mic
            break
        }
    }

The function converted into the UIImage type will then process each generated videoshot and display it on the screen.

func handleVideoFrame(sampleBuffer: CMSampleBuffer) {
        let imageBuffer = CMSampleBufferGetImageBuffer(sampleBuffer)!
        let ciimage = CIImage(cvPixelBuffer: imageBuffer)
        
        let context = CIContext(options: nil)
        var cgImage = context.createCGImage(ciimage, from: ciimage.extent)!
        let image = UIImage(cgImage: cgImage)
        render(image: image)
}

Here’s what it looks like:

generated frames

Captured screen broadcasting in WebRTC 

Usual setting: during a video call one peer wants to demonstrate the other what’s up on their screen. WebRTC is a great pick for it.

WebRTC connects 2 clients to deliver video data without any additional servers – it’s peer-to-peer connection (p2p). Check out this article to learn about it in detail. 

Data streams that clients exchange are media streams that contain audio and video streams. A video stream might be a camera image or a screen image.

To establish p2p connection successfully, configure a local mediastream that will further be delivered to the session descriptor. To do that, get an object of the RTCPeerConnectionFactory class and add the media stream packed with audio and video tracks to it.

func start(peerConnectionFactory: RTCPeerConnectionFactory) {

        self.peerConnectionFactory = peerConnectionFactory
        if self.localMediaStream != nil {
            self.startBroadcast()
        } else {
            let streamLabel = UUID().uuidString.replacingOccurrences(of: "-", with: "")
            self.localMediaStream = peerConnectionFactory.mediaStream(withStreamId: "\\(streamLabel)")
            
            let audioTrack = peerConnectionFactory.audioTrack(withTrackId: "\\(streamLabel)a0")
            self.localMediaStream?.addAudioTrack(audioTrack)

            self.videoSource = peerConnectionFactory.videoSource()
            self.screenVideoCapturer = RTCVideoCapturer(delegate: videoSource!)
            self.startBroadcast()
            
            self.localVideoTrack = peerConnectionFactory.videoTrack(with: videoSource!, trackId: "\\(streamLabel)v0")
            if let videoTrack = self.localVideoTrack  {
                self.localMediaStream?.addVideoTrack(videoTrack)
            }
            self.configureScreenCapturerPreview()
        }
    }

Pay attention to the video track configuration:

func handleSampleBuffer(sampleBuffer: CMSampleBuffer, type: RPSampleBufferType) {
        if type == .video {
            guard let videoSource = videoSource,
                  let screenVideoCapturer = screenVideoCapturer,
                  let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }
            
            let width = CVPixelBufferGetWidth(pixelBuffer)
            let height = CVPixelBufferGetHeight(pixelBuffer)
            videoSource.adaptOutputFormat(toWidth: Int32(width), height: Int32(height), fps: 24)
            
            let rtcpixelBuffer = RTCCVPixelBuffer(pixelBuffer: pixelBuffer)
            let timestamp = NSDate().timeIntervalSince1970 * 1000 * 1000
            
            let videoFrame = RTCVideoFrame(buffer: rtcpixelBuffer, rotation: RTCVideoRotation._0, timeStampNs: Int64(timestamp))
            videoSource.capturer(screenVideoCapturer, didCapture: videoFrame)
        }
}

Screen sharing with App Extension

Since iOS is a quite closed and highly protected OS, it’s not that easy to address storage space outside an app. To let developers access certain features outside an app, Apple created App Extensions – external apps with access to certain relationships in iOS. They operate according to their types. App Extensions and the main app (let’s call it Containing app) don’t interact with each other directly but can share a data storing container. To ensure that, create an AppGroup on the Apple Developer website, then link the group with the Containing App and App Extension. 

containing app and extension relation
Scheme of data exchange between entities

Now to devising the App Extension. Create a new Target and select Broadcast Upload Extension. It has access to the recording stream and its further processing. Create and set up the App Group between targets. Now you can see the created folder with App Extension. There’re Info.plist, the extension file, and the swift SampleHandler file. There’s also a class with the same name written in SampleHandler that the recorded stream will process. 

The methods we can operate with are already written in this class as well: 

override func broadcastStarted(withSetupInfo setupInfo: [String : NSObject]?)
override func broadcastPaused() 
override func broadcastResumed() 
override func broadcastFinished()
override func processSampleBuffer(_ sampleBuffer: CMSampleBuffer, with sampleBufferType: RPSampleBufferType)

We know what they’re responsible for by their names. All of them, but the last one. It’s where the last CMSampleBuffer and its type arrives. In case the buffer type is .video, this is where the last shot will be.

Now let’s get to implementing screen sharing with launching iOS Broadcast. To start off, we demonstrate the RPSystemBroadcastPickerView itself and set the extension to call.

let frame = CGRect(x: 0, y: 0, width: 60, height: 60)
let systemBroadcastPicker = RPSystemBroadcastPickerView(frame: frame)
systemBroadcastPicker.autoresizingMask = [.flexibleTopMargin, .flexibleRightMargin]
if let url = Bundle.main.url(forResource: "<OurName>BroadcastExtension", withExtension: "appex", subdirectory: "PlugIns") {
    if let bundle = Bundle(url: url) {
           systemBroadcastPicker.preferredExtension = bundle.bundleIdentifier
     }
}
view.addSubview(systemBroadcastPicker)

Once a user taps “Start broadcast” the broadcast will start and the selected extension will process the state and the stream itself. But how will the Containing App know this? Since the storage container is shared, we can exchange data via the file system – e.g. UserDefaults(suiteName) and FileManager. With it we can set up a timer, check up the states within certain periods of time, record and read data along a certain track. An alternative to that is to launch a local web-socket server and address it. But in this article we’ll only cover exchanging via files.

Write the BroadcastStatusManagerImpl class that will record current broadcast status as well as communicate to the delegate on status alterations. We’ll check on the updated info using a timer with 0.5 sec frequency. 

protocol BroadcastStatusSubscriber: AnyObject {
    func onChange(status: Bool)
}

protocol BroadcastStatusManager: AnyObject {
    func start()
    func stop()
    func subscribe(_ subscriber: BroadcastStatusSubscriber)
}

final class BroadcastStatusManagerImpl: BroadcastStatusManager {

    // MARK: Private properties

    private let suiteName = "group.com.<YourOrganizationName>.<>"
    private let forKey = "broadcastIsActive"

    private weak var subscriber: BroadcastStatusSubscriber?
    private var isActiveTimer: DispatchTimer?
    private var isActive = false

    deinit {
        isActiveTimer = nil
    }

    // MARK: Public methods

    func start() {
        setStatus(true)
    }

    func stop() {
        setStatus(false)
    }

    func subscribe(_ subscriber: BroadcastStatusSubscriber) {
        self.subscriber = subscriber
        isActive = getStatus()

        isActiveTimer = DispatchTimer(timeout: 0.5, repeat: true, completion: { [weak self] in
            guard let self = self else { return }

            let newStatus = self.getStatus()

            guard self.isActive != newStatus else { return }

            self.isActive = newStatus
            self.subscriber?.onChange(status: newStatus)
        }, queue: DispatchQueue.main)

        isActiveTimer?.start()
    }

    // MARK: Private methods

    private func setStatus(_ status: Bool) {
        UserDefaults(suiteName: suiteName)?.set(status, forKey: forKey)
    }

    private func getStatus() -> Bool {
        UserDefaults(suiteName: suiteName)?.bool(forKey: forKey) ?? false
    }
}

Now we create samples of BroadcastStatusManagerImpl to the App Extension and the Containing App, so that they know the broadcast state and record it. The Containing App can’t stop the broadcast directly. This is why we subscribe to the state – this way, when it reports false, App Extension will terminate broadcasting, using the finishBroadcastWithError method. Even though, in fact, we end it with no error, this is the only method that Apple SDK provides for program broadcast termination. 

extension SampleHandler: BroadcastStatusSubscriber {
    func onChange(status: Bool) {
        if status == false {
            finishBroadcastWithError(NSError(domain: "<YourName>BroadcastExtension", code: 1, userInfo: [
                NSLocalizedDescriptionKey: "Broadcast completed"
            ]))
        }
    }
}

Now both apps know when the broadcast started and ended. Then, we need to deliver data from the last shot. To do that, we create the PixelBufferSerializer class where we declare the serializing and deserializing methods. In the SampleHandler’s processSampleBuffer method we convert CMSampleBuffer to CVPixelBuffer and then serialize it to Data. When serializing to Data it’s important to record the format type, height, width and processing increment for each surface in it. In this particular case we have two of them: luminance and chrominance, and their data. To get the buffer data, use CVPixelBuffer-kind functions.

While testing from iOS to Android we’ve faced this problem: the device just wouldn’t display the screen shared. It’s that Android OS doesn’t support the irregular resolution the video had. We’ve solved it by just turning it into 1080×720. 

Once having serialized into Data, record the link to the bytes gained into the file.

memcpy(mappedFile.memory, baseAddress, data.count)

Then create the BroadcastBufferContext class in the Containing App. Its operation logic is alike BroadcastStatusManagerImpl: the file discerns each timer iteration and reports on the data for further processing. The stream itself comes in 60 FPS, but it’s better to read it with 30 FPS, since the system doesn’t perform well when processing in 60 FPS due to lack of the resource. 

func subscribe(_ subscriber: BroadcastBufferContextSubscriber) {
        self.subscriber = subscriber

        framePollTimer = DispatchTimer(timeout: 1.0 / 30.0, repeat: true, completion: { [weak self] in
            guard let mappedFile = self?.mappedFile else {
                return
            }

            var orientationValue: Int32 = 0
            mappedFile.read(at: 0 ..< 4, to: &orientationValue)
            self?.subscriber?.newFrame(Data(
                bytesNoCopy: mappedFile.memory.advanced(by: 4),
                count: mappedFile.size - 4,
                deallocator: .none
            ))
        }, queue: DispatchQueue.main)
        framePollTimer?.start()
    }

Deserialize it all back to CVPixelBuffer, likewise we serialized it but in reverse. Then we configure the video track by setting up the extension and FPS.

videoSource.adaptOutputFormat(toWidth: Int32(width), height: Int32(height), fps: 60)

Now add the frame RTCVideoFrame(buffer: rtcpixelBuffer, rotation: RTCVideoRotation._0, timeStampNs: Int64(timestamp)). This track goes to the local stream.

localMediaStream.addVideoTrack(videoTrack)

Conclusion 

Implementing screen sharing in iOS is not that easy as it may seem. Reservedness and security of the OS force developers into looking for workarounds to deal with such tasks. We’ve found some – check out the result in our Fora Soft Video Calls app. Download on AppStore.

Categories
Uncategorized

Video conference and text chat software development

video conference

The video conferencing market volume is $4.66 billion (TrueList). The global video conferencing market size is projected to grow to $22.5 billion by 2026 according to Video Conferencing Statistics 2022

We specialise in developing video and multimedia software and app. We have been doing it since 2005. Along the way we created a lot of messengers like Speakk and conferencing system like ProVideoMeeting. Freelancer or an agency that does not specialize in video software may pick the technology they are best familiar with. We will build custom product tailored for your needs.

Features

Industries 

Devices 

Technologies 

Costs

Features for video, audio, and text communication software

WebRTC videoconference

We develop for any number of participants:

  • One-on-one video chats
  • Video conferences with an unlimited number of participants

50 live videos on one screen at the same time was the maximum we’ve done. For example, Zoom has 100 live video participants, though it shows 25 live videos on one screen. To see the others, you switch between screens.

Some other functions: custom backgrounds, enlarging videos of particular participants, picking a camera and microphone from the list, muting a camera and microphone, and a video preview of how you look.

Conference recording

Record the whole screen of the conference. Set the time to store recordings on the server. For example, on imind.com we keep videos for 30 days on a free plan forever on the most advanced one.

Do not interrupt the recording if the recorder dropped off. In Zoom, if the recorder leaves, the recording stops. In imind.com it continues.

Screen sharing and sharing multiple screens simultaneously

Show your screen instead of a video. Choose to show everything or just 1 application – to not show private data accidentally.

Make all video participants share screens at the same time. It helps to compare something. Users don’t have to stop one sharing and start another one. See it in action at imind.com.

Join a conference from a landline phone

For those in the countryside without an Internet connection. Dial a phone number on a wired telephone or your mobile and enter the conference with audio, without a video. SIP technology with Asterisk and FreeSWITCH servers powers this function.

Text chat

Send text messages and emoticons. React with emojis. Send pictures and documents. Go to a private chat with one participant. See a list of participants.

Document editing and signing

Share a document on the conference screen. Scroll through it together, make changes. Sign: upload your signature image or draw it manually. Convenient for remote contract signing in the pandemic.

Polls

Create polls with open and closed questions. View statistics. Make the collective decision-making process faster!

Webinars

In the broadcast mode, display a presentation full-screen to the audience, plus the presenter’s video. Add guest speakers’ videos. Record the whole session to share with participants afterward.

Everlasting rooms with custom links

Create a room and set a custom link to it like videoconference.com/dailymeeting. It’s convenient for regular meetings. Ask participants to add the link to bookmarks and enter at the agreed time each time.

User management

Assign administrators and delegate them the creation of rooms, addition, and deletion of users.

Security

  • One-time codes instead of passwords
  • Host approves guests before they enter the conference
  • See a picture of the guest before approving him
  • Encryption: we enable AES-256 encryption in WebRTC

Custom branding

Change color schemes, use your logo, change backgrounds to corporate images.

Speech-to-text and translation

User speech is recognized and shown on the screen. It can be in another language for translation.

Watch videos together online

Watch a movie or a sports game together with friends. Show an employee onboarding video to the new staff members. Chat by video, voice, and text.

Subscription plans

Free plans with basic functionality, advanced ones for pro and business users.

Industries we developed real-time communication tools for

  • Businesses – corporate communication tools
  • Telemedicine – HIPAA-compliant, with EMR, visit scheduling, and payments
  • E-learning – with whiteboards, LMS, teacher reviews, lesson booking, and payments
  • Entertainment: online cinemas, messengers
  • Fitness and training
  • Ecommerce and marketplaces – text chats, demonstrations of goods and services by live video calls

Devices we develop for

  • Web browsers
    Chrome, Firefox, Safari, Opera, Edge – applications that require no download
  • Phones and tablets on iOS and Android
    Native applications that you download from AppStore and Google Play
  • Desktop and laptop computers
    Applications that you download and install
  • Smart TVs
    Javascript applications for Samsung and LG, Kotlin apps for Android-based STBs, Swift apps for Apple TV
  • Virtual reality (VR) headsets
    Meetings in virtual rooms

What technologies to choose

video chat development technologies
Technologies for video chat development

Basic technology to transmit video

Different technologies suit best for different tasks:

  • for video chats and conferences – WebRTC
  • for broadcasting to a big audience – HLS
  • for streaming to third-party products like YouTube and Facebook – RTMP
  • for calling to phone numbers – SIP
  • for connecting IP cameras – RTSP and RTP

Freelancer or an agency that does not specialize in video software may pick the technology they are best familiar with. It might be not the best for your tasks. In the worst case, you’ll have to throw the work away and redo it. 

We know all the video technologies well. So we choose what’s best for your goal. If you need several of these features in one project – a mix of these technologies should be used. 

WebRTC is the main technology almost always used for video conferences though. This is the technology for media streaming in real-time that works across all browsers and mobile devices people now use. Google, Apple, and Microsoft support and develop it.

WebRTC supports VP8, VP9 and H264 Constrained Baseline profile for video and OPUS, G.711 (PCMA and PCMU) for audio. It allows sending video up to 8,192 x 4,320 pixels – more than 4K. So the limitations to video stream quality on WebRTC are the internet speed and device power of the end-user. 

WebRTC video quality is better than in SIP-based video chats, as a study of an Indonesian university shows. See Figure 6 on page 9: Video test results and read the reasoning below it.

Is a media server needed for video conferencing software development?

For video chats with 2-6 participants, we develop p2p solutions. You don’t pay for the heavy video traffic on your servers.

For video conferences with 7 and more people, we use media servers and bridges – Kurento is the 1st choice. 

For “quick and dirty” prototypes we can integrate third-party solutions – ready implementations of video chats with media servers that allow slight customization. 

  • p2p video chats

P2p means video and audio go directly from sender to receivers. Streams do not have to go to a powerful server first. Computers, smartphones, and tablets people use nowadays are powerful enough to handle 2-6 streams without delays.

Many businesses do not need more people in a video conference. Telemedicine usually means just 2 participants: a doctor and a patient. The development of a video chat with a media server is a mistake here. Businesses would have to pay for the traffic going through the server not receiving any benefit.

  • Video conferences with a media server

Users cannot handle sending more than 5 outgoing video streams without lags now. People’s computers, smartphones, and tablets are not powerful enough. While sending their own video, they accept incoming streams. So for more than 6 people in video chat – each sends just 1 outgoing stream to a media server. The media server is powerful enough to send this stream to each participant.

Kurento is our first choice of media servers now for 3 reasons:

  • It is reliable.

    It was one of the first media servers to appear. So it gained the biggest community of developers. The more developers use technology the faster they solve issues, the quicker you find the answers to questions. This makes development quicker and easier, so you pay less for it.

    Twilio bought Kurento technology for $8.5 million. Now Twilio provides the most reliable paid third-party video chat solution, based on our experience.

    In 2021, other media servers have smaller developers’ and contributors’ communities or are backed by not-so-big companies, based on our experience and impression. They either are not as reliable as Kurento or do not allow developing that many functions.
  • It allows adding the widest number of custom features.

    From screen sharing to face recognition and more – we have not faced any feature that our client would want, not possible to develop with Kurento. To give developers this possibility, the Kurento contributors had to develop each one separately and polish it to a well-working solution. Other media servers did not have that much time and resources to offer the same.
  • It is free.

    Kurento is open-source. It means you may use it in your products legally for free. You don’t have to pay royalties to the technology owner.

We work with other media servers and bridges – when not that many functions are needed, or it is an existing product already using another media server:

We compare media servers and bridges regularly as all of them develop. Knowing your needs, we recommend the optimal choice.

  • Integration of third-party solutions

Third-party solutions are paid: you pay for minutes of usage. The development of a custom video chat is cheaper in the long run.

Their features are also limited to what their developers developed.

They are quicker to integrate and get a working prototype though. If you need to impress investors – we can integrate them. You get your app quicker and cheaper compared to the custom development.

However, to replace it with a custom video chat later – you’ll have to throw away the existing implementation and develop a custom one. So, you’ll pay twice for the video component.

We use these 3 -they are the most reliable ones based on our experience:

Write to us: we’ll help to pick optimal technologies for your video conference.

How much the development of a video conference costs

You’re here – means ready solutions sold as is to integrate into your existing software probably do not suit you and you need a custom one. The cost of a custom one depends on features and their complexity. So we can’t say the price before knowing these.

Take even the log in function as an example. A simple one is just through email and password. A complex one may have a login through Facebook, Google, and others. Each way requires extra effort to implement. So the cost may differ several times. And login is the simplest function for a few work hours. Imagine how much the added complexity will influence the cost of more complex functions. And you’d probably have quite a lot of functions.

Though we can give some indications.

✅ The simplest video chat component takes us 2-4 weeks and costs USD 8000. It is not a fully functioning system with login, subscriptions, booking, etc. – just the video chat with a text chat and screen sharing. You’d integrate it into your website or app and it would receive user info from there. 

✅ The simplest fully functional video chat system takes us about 4-5 months and around USD 56 000. It is built from the ground up for one platform – either web or iOS or Android for example. Users register, pick a plan, and use the system.

✅ A big video conferencing solution development is an ongoing work. The 1st release takes about 7 months and USD 280 000.
Reach us, let’s discuss your project. After the 1st call, you get an approximate estimation.

Categories
Uncategorized

iOS Automated Testing – Who Needs it and Who Don’t + Best Tools [2023]

We at Fora Soft have been using iOS automating testing for some of our projects. Learn from our expertise what mobile automated testing is all about, when you need it and when you don’t. Don’t forget to check out the list of the most popular iOS automated testing tools below. 

What is iOS testing automation? Pros & Cons

When do you need automated testing?

When you don’t need automated testing?

What are the best automated testing tools in iOS?

In recent years, more organizations have begun developing native app versions of their web applications. With the number of mobile apps downloads increasing, the brands will have to focus more on testing automatically to speed up a new app version release process. Time is money and this way you can stay ahead of the competition. 

As per the MarketsAndMarkets report, the global automation testing market size is expected to grow from USD 24.7 billion in 2022 to USD 52.7 billion by 2027, at a CAGR of 16.4%. To keep up with the pace of the ever-changing testing landscape, you should be familiar with the latest news on iOS testing automation. 

What is iOS testing automation?

Automated testing is imitating real user action according to a certain scenario. For example subscription purchasing, data inserting, button tapping, etc. All done automatically using special software and with no human participation. You can automate testing of any product, including an iOS app.

Advantages

  • Concurrent testing on multiple devices. That means, more tests can be run per build/release. For instance, automated regression testing that ensures a new feature or a change in the code doesn’t affect the overall system performance. This results in a better quality of released products = better user experience.
  • Faster testing process. You save time and money for better product quality.
  • Minimized human error. Automated testing ensures thorough testing of the same areas over and over again. Going through the same feature might be monotonous to a human and result in them missing bugs and errors.
  • Testing transparency. When the testers swap, old scripts stay, and the testing process will continue as intended. Regression testing stays the same, too. If you need to change something, or if a new tester wants to check the application logic, the script will work as documentation. This is one of the main advantages.

Disadvantages:

  • When iOS updates, you have to wait for the automated testing tools to update, too.
  • The initial development of automated tests can be time-consuming and costly (but it’s worth it in the long run).

When do you need automated testing?

Do you even need it? The answer is “yes”, if:

  • Your app has too many functions and features, and you are going to support it while adding new things along the way. Why do you need auto testing then? New functionality can conflict with the old one. For example, you’ve introduced a chat and calls broke. To find out the problem, the tester has to test the whole app manually. It takes a lot of time, and the tester is at risk of missing something else! Auto testing helps avoid that problem, as the tester won’t have to test everything after introducing each feature. Whatever stayed the same will be checked automatically, as long as you launch the tests and then collect the results. It helps reduce the testing time and the development costs.
  • You are going to adapt the app for each new iOS version and take advantage of new system possibilities. Every iOS update can break something within the app. Even if you never planned to update the app in the near future, it might be that you have to. In that case, automated testing will help with that. After running the tests, you’ll understand what it was that broke and be able to solve the problem. Obviously, you wouldn’t add automated testing with this sole purpose, but they will help a great deal if they are already implemented.
  • There are testers on your team that possess some knowledge of mobile automated testing. They at least have to know some popular programming language if you go for, say, Appium. Or, they have to know Swift if your choice is XCTest / EarlGrey / KIF. Testers also need to know how to work with mobile testing tools. If your employees only know how to manually test apps and have no knowledge of programming languages whatsoever, you will either have to teach them or hire new ones.

When you don’t need automated testing?

However, writing automated tests is programming, although you’re not writing new functions for your app but rather a program that goes through your product and checks it. It is expensive. It won’t be worth it to add automated tests if:

  • The app is small. It doesn’t have lots of functions and it is easy to test it manually. With that being said, you are not planning on adding new functions on a constant basis.
  • The app is supposed to be developed and distributed within a short period of time, such as those for the World Cup-2018 or the Olympics-2014.
  • The app changes frequently. The functionality is unstable. For instance, a startup that looks for its client and keeps changing the main features.

What are the best automated testing tools in iOS?

automated testing tools comparison
Automated testing tools comparison

XCUITest/XCTest

Apple developed a fully native tool that is out there only for testing iOS apps. Since it is native, external dependencies won’t be there. You develop tests in Swift or Objective-C, which helps developers and testers interact more effectively. However, developing in those languages isn’t that simple. It might be that testers will turn to developers to ask for help far too often, which will make work a bit chaotic. 

There is a test recorder, too. It records real actions with an app and creates a test out of them, but using it is actually quite hard. It isn’t very accurate and it’s best to use it as an assisting tool while developing main tests in Swift or Objective-C. XCUITest/XCTest also works in a separate thread, and doesn’t read the state of an app. Therefore, delays in updating the data may lead to a failure to find requested elements.

EarlGrey

The Framework by Google. It requires tests in Objective-C or Swift. The framework synchronizes requests, UI, and streams – that’s the advantage of it. However, EarlGrey isn’t very popular because you can only test iOS apps with it. It isn’t very different from XCUITest, yet it is not native, so testers would rather use XCUITest.

KIF

KIF is a framework that has to be added to the project to use it. Objective-C or Swift are the testing languages. Its realism is its main competitive edge. KIF can simulate interaction with a user, therefore it’s very good for UI testing.

You see the iOS-only tools above but when mobile development is in question, oftentimes the developers go for both iOS and Android apps. So no wonder there are cross-platform tools for automated testing.

Detox

JavaScript is a language for tests with Detox. It can access the memory and control ongoing processes. The framework works with emulators and real devices and uses native methods right on the device. It also uses EarlGrey and is considered to be really good at testing apps written in React Native, because React Native uses JavaScript, just like Detox. It allows for writing the same tests for Android and iOS.

Appium

Appium is the most popular tool nowadays. It allows testing apps regardless of the platform, type, and system version. Writing tests for each platform is possible using a unified API, without adapting the app to work with a specific testing framework. Appium doesn’t require adding to the app source code. It works as a separate tool. Let’s take a look at its advantages:

  • A large choice of languages for tests: Java, C#, Python, Ruby, JavaScript. It means that Appium doesn’t only work with Objective-C or Swift, so testers with knowledge of any of the supported languages will be able to write tests. An app doesn’t need re-compiling or changing it for automation’s sake. It’s important because the test source code and the app source code aren’t in the same project, and they are developed separately. These two don’t depend on each other, so one can avoid many problems. For example, if somebody wrote the tests incorrectly and they don’t compile, it won’t affect the app in general.
  • It is cross-platform. The testers can develop tests for iOS and Android in the same environment, in the same language. They can even re-use the code. It saves time and money.
  • Wide functionality. You can launch and stop the app, check the visibility of elements on the screen, and use gestures. Simulators and real devices work with Appium.

Appium has some disadvantages, too. It is essentially a superstructure over native iOS and Android drivers. The tests can break more often due to the mistakes in the superstructure code. It’s important to note here that Appium is very popular, develops quickly, so arising problems will likely be solved in the future.

Conclusion

So, a quick summary.

You should consider automating your testing process if:

  • your app is complex and has many features
  • you don’t dramatically change the features too often
  • you want to save time on testing basic functions like login, payment, booking, etc. 

Tools and methods:

We’ve covered the most popular iOS automated testing tools in the sheet above. Whenever you choose a tool, consider the application’s peculiarities, check if there will be an Android version of the app. Also, consider your testing team and their preferences! 🙂 Want to automate your mobile app testing process and stay ahead of the competition? Contact us! We will reply shortly to discuss details and provide a time-money estimation based on that.

Categories
Uncategorized

What iOS developer should know. The hiring process at Fora Soft

In this article, we will explain how Fora Soft hires people. What an iOS developer should know and what interview stages they complete. Besides, you will also find out about how we carry out the mentoring process and how programmers progress.

Minimal hiring requirements

A developer should know:

  • OOP. This is a programming methodology based on representing a program as a combination of objects
  • Swift, Obj-C (reading the code). Swift is a programming language developed by Apple, and they use it to write programs for all their products. Obj-C is a previous language that Apple used for the apps. Even now one still can create objects with it but people mostly use Swift. However, one needs to at least read Obj-C, as older projects are written in it, and they need support
  • iOS SDK. Knowledge of main frameworks, such as UIKit (work with graphic representation), Foundation (work with network and date), AVKit (work with media), MapKit (work with maps), CoreLocation  (work with geolocation)
  • Apple Guideline. For an app to be approved for AppStore, it has to answer Apple’s requirements. Read Apple documentation to find out more
  • AutoLayout. This is a mechanism for page-making in the app, it is responsible for placing interface elements on the screen
  • Multithreading. It’s important for an app to complete many processes simultaneously. For example, to make a request into a network and show the data loader
  • SOA (REST API, Web Socket). To work with the network, one must understand its organization
  • Git. The version control system is out there to make the project work easier and make it possible to put a group effort into it. Besides, it allows keeping several versions of the same document. It’s also possible to return to earlier versions, determine who and when made a change, etc.

The selection and recruitment process

It consists of

  1. A candidate sends their CV to an HR
  2.  HR looks at the CV and calls the candidate if the CV meets the requirements
  3. HR speaks about the company and answers questions. Then an interview happens, where HR tests the candidate’s professional qualifications. The questions are related to the specific position.
  4. If the candidate answers the questions successfully, he is invited to join the next stage – a technical interview with an HR and team lead. The lead does the talking here. He asks not only about the job itself, but about the IT world as well. It’s important to know how well-rounded the candidate is. The HR then conducts an office tour
  5. We invite almost all candidates who passed Step 4 to complete a test assignment within a week
  6. The candidate sends the assignment, which the team lead reviews and sends feedback to the HR. Here the decision whether we invite the candidate to the final interview arises
  7. The CEO attends the final interview. The candidate receives some specific assignment, for example, developing a video call system. The candidate has to explain how they’d like to proceed with the task
  8. We send an offer

When I find an employee who turns out to be wrong for the job, I feel it is my fault because I made the decision to hire him.

Akio Morita, Sony founder

As you can see, Fora Soft takes the hiring process very seriously. Let’s take a look at the statistics that the HR department has provided.

JavaScript statistics (relevant for 1,5 months):

  1. 500 candidates
  2. 20% pass the phone interview – 100 people
  3. 40% pass the face-to-face interview – 40 people
  4. 30% complete the test assignment and the technical interview with the team lead – 12 people
  5. 90% pass the final interview with the CEO – 10 people

iOS:

  1. 50 candidates
  2. 50% complete the phone interview – 25
  3. 20% complete the test assignment and the technical interview with the team lead – 5 people 
  4. 20% complete the final interview with the CEO – 1 person

To sum it up:

We send an offer to 10 JavaScript developers out of 500, which means 2%

We send an offer to 1 iOS developers out of 50, which is again 2%.

We just showed you the numbers. Go with any output you deem fit 🙂

Mentoring process

The mentor is responsible for:

  • Code review. The new programmer creates a separate branch for it, according to Git Flow. Upon completing the task, the new programmer makes a merge request, and the mentor checks the result. The good mentor will pay attention to the logic behind completing the task, leave their feedback, and send it back for the refinement. If the mentor is satisfied with everything, the merge into the development branch happens. Thanks to this mechanism, it’s possible to see how the new programmer progresses. Over time, the number of comments goes down, and the code immediately ends up in Develop
  • Meetings about the developer’s weak and strong suits. After some time, the mentor will form a professional portrait of the new developer, based on the code, approach to tasks, and other teammates’ feedback. As soon as the portrait is ready, the mentor and the developer talk about everything. These meetings happen quite often, approximately once a month
  • Informational basis (Q&A). The new developer can always ask a mentor for help
  • Task distribution. The mentor determines what the new programmer can do now and what is too early for them. The difficulty of assignments grows as the developer grows

How constant development happens

  • Development plan. Every developer creates the plan of the aims. What is it? We take a period of time and create goals for each month. For example, read book X, learn technology Y, watch conference Z. Upon the completion of each milestone, we put a mark. That way it’s easy to see how the developer progresses
  • A gradual increase in task difficulty. The mentor gives tasks to the developer depending on their difficulty. The good mentor will never assign something the new programmer can’t complete. Over time, the difficulty grows
  • Lectures within the company. Anybody can host a lecture. Found an interesting technology? Learn it yourself and let others know!
  • Collective meeting attendances. We keep an eye on the IT world and meet-ups in other companies. We attend those events together, and then create a short review on them

Conclusion

From the statistics, one can understand that we have very high requirements for candidates, and there’s a reason.

The team guarantees the high quality of Fora Soft products. First, we make hiring decisions carefully. Second, this is a culture of constant development. We keep a close eye at new colleagues and their code, help them find themselves in the company, level up their skills.

Thanks to that, our products have such high quality.

You have to love your work. To do that well, you have to enjoy your work. We at Fora Soft are driven by that desire – do things awesomely. We realize what kind of a person is and whether we pursue the same goals during interviews. We never leave a new employee to be eaten by projects. We are always close to guide, give advice, and just look after.

With all that said, our clients are always happy, which makes us happy, too. We created a cool product and made the client happy? The end-user is also happy because of what we did? That’s the goal we pursue.

Categories
Uncategorized

Native or cross-platform application?

Whenever you create an iOS application, the question you ask first is usually “Do I have to develop a native solution using Swift or go with a cross-platform app?”. You will be able to answer this question, as well as understand the advantages and disadvantages of the both options, after reading this article.

video conference app

Short description of cross-platform solutions

React Native

Facebook created and supports this platform. React Native helps develop cross-platform mobile apps on JavaScript. With that being said, developers are free to use up to 70% of the code between different platforms, such as iOS and Android.

Flutter

Flutter is a young yet promising platform. It has attracted attention from big companies that developed their apps. Flutter’s simplicity is comparable with web applications, and its speed is equal to that of native apps’. The programming language that goes with Flutter is Dart, which compiles into a binary code. That allows operation speed to be comparable to Swift’s and Objective-C’s.

Xamarin

Xamarin is a framework for cross-platform mobile application development, which uses the C# language. Microsoft bought Xamarin in 2016, made the source code of Xamarin SDK open, and included it into IDE Microsoft Visual Studio.

Short description of native solutions

Objective-C

Objective-C used to be the main language of development for iOS until not so long ago. This language is a superset of C, that’s why the compiler Objective-C fully understands C-code. Therefore, an app created with Objective-C can be very fast. Objective-C also has object-oriented programming, which helps make programs that you can easily scale in the future if needed. However, this language is quite old. Stepstone created it in the 1980s. Apple, on the other hand, is a very progressive company, so it wasn’t a huge surprise when they introduced a new programming language in 2014 – Swift.

Swift

Swift is a language that allows writing applications for phones, desktop computers, and servers. The compiler is optimized for productivity, and the language is optimized for development, with no compromises from each side. Swift has years of development behind it, and it’s still moving forward, constantly learning new opportunities. When it was just released, the community split into two parts. The first one believed that there was nothing better than Objective-C. The second one rooted for Swift and tried to use it to create apps. Now, after several years of development, it’s safe to say that when it comes to creating iOS apps, Swift is language #1.

Advantages and disadvantages of cross-platform solutions

The idea of writing an app for iOS and Android simultaneously does draw attention to itself. However, nothing is perfect.

Advantages:

  • Simplicity. Choose between JavaScript, C#, or Dart as the main programming language. More developers can work with those, which will simplify the development process
  • Speed and cost of development. You only need one team of developers to create an app that will look the same on both iOS and Android. When you need to create an app for the both platforms quickly, this becomes a substantial advantage.

Disadvantages:

  • Safety. Almost all cross-platform solutions have an open source code, and any thief who knows how to program can look at it, find the weak spots, and hack your app. It’s also important that a cross-platform app connects with the backend via usual HTTP-calls, and thieves can easily intercept your data and use it for their advantage (read more about it here)
  • The difficulty of work with iOS native functions. Swift developers have integrated useful modules into the language. You can work with audio, video, phone camera, location, Bluetooth with those modules. When developing a cross-platform app, the work with these functions is more difficult. For example, to add an AR-object on a video from a camera or demonstrate a screen during an online call, you need to develop additional modules. It increases the time spent on developing, making it more expensive
  • Speed and interface responsiveness. When an app that shows some data (for example, an online shop or a newsfeed) is developed, the speed of a cross-platform app is equal to a native one, but it will often be lower. If your app supports calls, video chats, AR, it works even slower compared to a native app. Users won’t like it if they miss half of what their interlocutor was speaking about, or if they are unable to catch their favorite Pokemon due to low interface responsiveness.

Advantages and disadvantages of native solutions

According to research conducted by Andrew Madsen’s blog, out of 79 most popular non-game applications in the App Store, about 53% are written in Swift, and the other 47% don’t use Swift. It’s important to mention that somebody from that 47% can be using Objective-C, which is a native language, too.

There is also research by statista.com, which says that ⅔ of all apps are native, for both Android and iOS. Why is that?

Advantages:

  • All peculiarities of a platform are considered. No doubt that developing an app for the both platform at the same time is convenient, but each one of them is individual. Requirements for safety, interface design, and payment system integration differ. For example, system elements that iOS and Android have are absolutely different (the example is on the image below). The user expects to see elements familiar to the platform.
ios and android differences
iOS and Android design differences
  • Speed and interface responsiveness. Native-written apps work faster. It’s a lot more convenient for a user to use an app where the animation is smooth, touching buttons is processed instantly, and they can scroll the screen without freezes, while quickly loading content. It is very important as people are actively using apps nowadays to go shopping, visit doctors, attend business meetings. No one would want their screen to freeze in the moment of payment or during important meetings. These things can make the user look for an alternative
  • No obstacles to updating apps or widen their functionality. Platforms evolve, they add new functions, and apps must support them. An iOS update can completely break an application. Unless cross-platform app developers release a new version, the app may not be working, and there is nothing you can do about it
  • Access to own functions and private API platforms. Swift developers have integrated useful modules into the apps, as we’ve mentioned earlier. Whenever you want to create an app with online conferences, AR, or sharing via Bluetooth (push ads upon entering a Bluetooth tracker area or money transfer in bank apps), the developers won’t have to create these modules themselves, which saves time and money
  • Safety. The source code of operating systems and native ways of development is closed. Gaining access to it is impossible, unlike cross-platform apps with an open source code that anyone can access.

Disadvantages:

  • You need to create two applications
  • You then need to support these two applications

Conclusion

Cross-platform and native ways of development have advantages and disadvantages to them. There is no multi-purpose tool that will be better everywhere. When choosing development tools, take an app type, app’s and platform’s peculiarities, your money, and goals into consideration.

For example, go for a cross-platform solution, if you:

  • Are limited in time and money
  • Need your app to look the same on all platforms, despite their peculiarities
  • Don’t need your app to use platform-specific functions like working with the phone camera, difficult animations, photo and video editing, Bluetooth, online calls
  • Don’t require your app to be extremely safe

Examples: news apps, pizza ordering apps, a beauty salon registering apps, online shops.

On the other hand, go for native tools if your app:

  • Is supported during a long period of time
  • Uses a phone camera, difficult animations, Bluetooth, video and audio calls, streams
  • Requires support for new platform functionality after the platform update
  • Has different design on different platforms
  • Looks the way the platform guidelines recommend
  • Is demanding to safety
  • Requires high speed of work and interface responsiveness, however new and powerful a device is.

Examples: e-learning, medicine, internet TV, video chats, video surveillance, augmented reality.

Categories
Uncategorized

In-App Purchase in iOS apps: how to avoid 30% Apple Pay commission

There are many ways to monetize an application. What affects your choice here are the aims and specifics of your application and the market, for which it was made. One of those methods is organizing purchases within the app. From this text, you will find out how iOS organized the process, what Apple and their competitors provide you with, and why you sometimes will have no choice.

How_To_Avoid_AppStore_Comission

In-App Purchases

This simple and easy to use mechanism was developed by Apple to help organize sales of their apps or of additional features from their apps. Apple takes a 30% fee from every purchase made with In-App Purchases

There are three types of In-App Purchases:

  • Consumable

This purchase can be done multiple times. For example, lives or energy in games.

  • Non-consumable

This purchase can only be done once. For example, a character in a game or a movie in an online theater.

  • Subscriptions (auto-renewable and non-renewable)

A payment that unlocks your app’s functions for a limited period of time. Auto-renewable subscriptions charge users automatically at the end of each paid period. To continue using non-renewable subscriptions, users need to renew them manually. iTunes is an example of that.

A few other payment systems

Stripe is an American company that develops solutions for accepting and processing electronic payments. Stripe allows users to integrate payment processing into their apps without a need to register a merchant account.

Stripe takes 2.9% + 30 cents from each successful transaction.

PayPal is the largest debit digital payment platform. PayPal users are able to pay bills, make purchases, accept, and send money transactions.

PayPal takes from 2.9% to 3.9% commission fee, depending on how expensive the product was. The exact fee amount depends on your sales figures and whether you trade domestically or internationally.

Do I need In-App Purchases?

Apple charges lots of money in comparison to their competitors. Going for Stripe or PayPal might look like a no-brainer, but it’s not so simple. When you develop an iOS application, you face multiple requirements from Apple. One of those requirements prohibits you from making purchases through something other than In-App Purchases.

All digital and virtual goods and services must be paid via In-App Purchases. Therefore, owners of entertainment apps and online movie theaters, digital content sellers, and others must use In-App Purchases.

On the other hand, if you’ve created a mobile app for your online store, tour agency, or air ticket office, the outcome of the deal between you and your buyer is a physical item or a physical document that proves your right to use the service. In that case, you can use an external payment system and get your money fast, avoiding being ripped-off by the App Store.

Categories
Uncategorized

WebRTC in iOS: How to Use It in Your iOS [Code Guide]

You’ve probably heard of WebRTC if you wanted to create an online conference app or introduce calls to your application. There’s not much info on that technology, and even those little pieces that exist are developer-oriented. So we aren’t going to dive into the tech part but rather try to understand what WebRTC is.

WebRTC in brief

WebRTC (Web Real Time Communications) is a protocol that allows audio and video transmission in realtime. It works with both UDP and TCP and can switch between them. One of the main advantages of this protocol is that it can connect users via a p2p connection, thus it transmits directly while avoiding servers. However, to use p2p successfully, one must understand p2p’s and WebRTC’s peculiarities.

You can read more in-depth information about WebRTC here

Stun and Turn

Networks are usually designed with private IP addresses. These addresses are used within organizations for systems to be connected locally and they aren’t routed in the Internet. In order to allow a device with a private IP to contact devices and resources outside the local network, a private address must be translated to a publicly accessible address. NAT (Network Address Translation) takes care of the process. You can read more about NAT here. We just need to know that there’s a NAT table in the router and that we need a special record in the NAT which allows packets to our client. To create an entry in the NAT table, the client must send something to a remote client. The problem is that neither of the clients knows their external addresses. To deal with this, STUN and TURN servers were invented. You can connect two clients without TURN and STUN, but it’s only possible if the clients are within the same network.

STUN is directly connected to the Internet server. It receives a packet with an external address of the client that sent it and sends it back. The client learns its external address and the port that’s needed for the router to understand which client has sent the packet. That’s because several clients can simultaneously contact the external network from the internal one. That’s how the entry we need ends up in the NAT table.

TURN is an upgraded STUN server. It can work like STUN, but it also has more functions. For example, you will need TURN when NAT doesn’t allow packets sent by a remote client. It happens because there are different types of NAT, and some of them not only remember an external IP, but also a STUN server port. They don’t allow packets received from servers other than STUN. Not only that, it’s also impossible to establish a p2p connection inside 3G networks. In those cases you also need a TURN server which becomes a relay, making clients think that they’re connected via p2p.

Signal server

We know now why we need STUN and TURN servers, but it’s not the only thing about WebRTC. WebRTC can’t send data about connections, which means that we can’t connect clients using only WebRTC. We need to set up a way to transfer the data about connections (what is the data and why it’s needed, we’ll see below). And for that, we need a signal server. You can use any means for data transfer, you only need the opponents to exchange data among themselves. For instance, Fora Soft usually uses WebSockets.

Video calls one-on-one

Although STUN, TURN, and signal servers have been discussed, it’s still unclear how to create a call. Let’s find out what steps we shall take to organize a video call.

Your iPhone can connect any device via WebRTC. It’s unnecessary for both clients to be related to iPhone, as you can also connect to Android devices or PCs.

We have two clients: a caller and one who’s being called. In order to make a call, a person has to: Receive their local media stream (a stream of video and audio data). Each stream can consist of several media channels. There can be a few media streams: from a camera and a desktop, for example. Media stream synchronizes media tracks, however, media streams can’t be synchronized between each other. Thus, sound and video from the camera will be synchronized with one another but not with the desktop video. Media channels inside the media track are synchronized, too. The code for the local media stream looks like this:

func startLocalStream() {
    let stream = streamsContainer.stream(forIdentifier: PublishStreamModel.publish)
    stream.startCameraCapturer(processDeviceRotations: false, prefferedFrameSize: CGSize(width: 640,height: 480), prefferedFrameRate: 15)
}
  • Create an offer, as in suggesting a call start.
if self.positioningType == .caller {
    self.prepareAndSendOffer()
}
  • Send their own SDP through the signal server. What is SDP? Devices have a multitude of parameters that need to be considered to establish a connection. For example, a set of codecs that work with the device. All these parameters are formed into an SDP object or a session descriptor that is later sent to an opponent via the signal server. It’s important to note that the local SDP is stored as text and can be edited before it’s sent to the signal server. It can be done to forcefully choose a codec. But it’s a rare occasion, and it doesn’t always work.
func stream(_ stream: StreamController?,
            shouldSendSessionDescriptionsessionDescriptionModel: StreamSessionDescriptionModel,
            identifier: String,
            completion: ((Bool)-> ())?) {
    shouldSendSessionDescription?(sessionDescriptionModel, identifier)
}
  • Send their Ice Candidate through a signal server. What’s Ice Candidate? SDP helps establish a logical connection, but the clients can’t find one another physically. Ice Candidate objects possess information about where the client is located in the network. Ice Candidate helps clients find each other and start exchanging media streams. It’s important to notice that that the local SDP is single, while there are many Ice Candidate objects. That happens because the client’s location within the network can be determined by an internal IP-address, TURN server addresses, as well as an external router address, and there can be several of them. Therefore, in order to determine the client’s location within the network, you need a few Ice Candidate objects.
func stream(_ stream: StreamController?,
            shouldSendCandidate candidateModel: StreamCandidateModel,
            identifier: String,
            completion: ((Bool) -> ())?) {
    shouldSendCandidate?(candidateModel, identifier)
}
  • Accept a remote media stream from the opponent and show it. With iOS, OpenGL or Metal can be used as tools for video stream rendering.
func stream(_ stream: StreamController?, shouldShowLocalVideoView videoView: View?, identifier id: String) {
    guard let video = videoView else { return }
    self.localVideo = video
    shouldShowRemoteStream?(video, id)
}

The opponent has to complete the same steps while you’re completing yours, except for the 2nd one. While you’re creating an offer, the opponent is proceeding with the answer, as in answers the call.

if self.positioningType == .callee && self.peerConnection?.localDescription == nil {
    self.prepareAndSendAnswer()
}

Actually, answer and offer are the same thing. The only difference is that while a person expecting a call sets up an answer means, while they generate their local SDP, they rely on the caller’s SDP object. To do that, they refer to the caller’s SDP object. Therefore, the clients know about both device parameters and can choose a more suitable codec.

To summarize: the clients first exchange SDPs (establish a logical connection), then Ice Candidates (establish a physical connection). Therefore, the clients connect successfully, they can see, hear, and talk with each other.

That’s not everything one needs to know when working with WebRTC in iOS. If we leave everything as it is at the moment, the app users will be able to talk. However, only if the application is open, will they be able to learn about an incoming call and answer it. The good thing is, this problem can be easily solved. iOS provides us with a VoIP push. It’s a kind of push notification in iOS, and it was created specifically for work with calls. This is how it’s registered:

// Link to the PushKit framework
import PushKit

// Trigger VoIP registration on launch
func application(application: UIApplication, didFinishLaunchingWithOptions launchOptions: NSDictionary?) -> Bool {
    self.voipRegistration()
    return true
}
    
// Register for VoIP notifications
func voipRegistration() {
    let mainQueue = dispatch_get_main_queue()
    // Create a push registry object
    let voipRegistry: PKPushRegistry = PKPushRegistry(mainQueue)
    // Set the registry's delegate to self
    voipRegistry.delegate = self
    // Set the push type to VoIP
    voipRegistry.desiredPushTypes = [PKPushTypeVoIP]
}

This push notification helps show an incoming call screen which allows the user to accept or decline the call. It’s done via this function:

func reportNewIncomingCall(with UUID: UUID,
                           update: CXCallUpdate,
                           completion: @escaping (Error?) -> Void)

It doesn’t matter what the user is doing at the moment. They can be playing a game or having their phone screen blocked. VoIP push has the highest priority, which means that notifications will always be arriving, and the users will be able to easily call one another. VoIP push notifications have to be integrated along with call integration. It’s very difficult to use calls without VoIP because for a call to happen, the users will have to have their apps open and just sit and wait for the call. That can be classified as strange behavior. The users don’t want to act strange, so they’ll probably choose another application.

Conclusion

We’ve discussed some of the WebRTC peculiarities; found out what’s needed for two clients to connect; learned what steps the clients need to take for a call to happen; what to do besides WebRTC integration to allow iOS users to call one another. We hope that WebRTC isn’t a scary and unknown concept for you anymore, and you understand what you need to apply it to your product. Don’t fear of doing it, as WebRTC is pretty secure!