We are excited to announce a new Gcore development: our JIT (Just-In-Time) Packager. This solution facilitates simultaneous streaming across six protocols: HLS, DASH, L-HLS, Chunked CMAF/DASH, Apple Low Latency HLS, and HESP. In this article, we’ll explain why HLS and DASH streaming make low latency a challenge, dive into alternative technologies with exciting latency reduction potential, and then tell you about our JIT Packager—why we developed it, how it works, its capabilities, benefits, and results.
The difficulty in achieving low latency with standard HLS and DASH technologies stems from their recommended segment length and buffer size guidelines, which can result in latency of twenty seconds or more. Let’s explore why this is the case.
In conventional internet streaming, technologies such as HLS (HTTP Live Streaming) and DASH (Dynamic Adaptive Streaming over HTTP) are commonly employed. These protocols are based on HTTP and divide video and audio content into small segments spanning a few seconds. This segmentation facilitates fast navigation, bitrate switching, and caching. The client receives a textual document containing the sequence of segments, their addresses, and additional metadata such as resolution, codecs, bitrate, duration, and language. However, adhering to the recommended segment length and buffer size guidelines of HLS and DASH protocols makes achieving low latency challenging. The latency can be upwards of twenty seconds.
Here’s an example: Let’s say we initiate transcoding and create segments with a duration of 6 seconds. Next, we start playing this stream. This means that the player first needs to fill its buffer by loading three segments. By the time we start, we observe that the three fully formed and closest-to-real-time segments are segments 3, 4, and 5. Consequently, playback will begin with the 3rd segment, and it’s easy to calculate the delay based on the segment duration, which will be a minimum of 18 seconds.
Using shorter segments, such as 1-2 seconds, we can reduce the delay. Segments with a duration of 2 seconds would result in a minimum delay of 6 seconds. However, this would require reducing the GOP (group of pictures) size. Reducing SOP size reduces encoding efficiency and leads to increased traffic overhead, because each segment contains not only video and audio but also additional overhead. Additionally, there is overhead from the HTTP protocol with each segment request.
This means that shorter segments lead to a larger number of segments and, consequently, higher overhead. With a large number of viewers constantly requesting segments, this would result in significant traffic consumption.
To achieve lower latency in streaming, several specialized solutions can be utilized:
- Chunked CMAF/DASH
- Apple Low Latency HLS
These solutions differ from traditional HLS and DASH protocols in that they are specifically tailored for low-latency streaming.
Now, let’s dive into these protocols in more detail.
In the case of L-HLS, the client receives new fragments of the last segment as they become available. This is achieved by declaring the address of this segment in the playlist using a special PREFETCH tag. By so doing, it is possible to significantly reduce latency, and the data path is shortened according to the following steps:
- The server declares the address of the new live segment in the playlist in advance.
- The player requests this segment and starts receiving its first chunk as soon as it becomes available on the server.
- Without waiting for the next set of data, the player proceeds to play the received chunk.
When it comes to chunked CMAF/DASH, the standard includes fields that control the timeline, update frequency, delay, and distance to the edge of the playlist. The key enhancements in the Dash.js v2.6.8 reference player version are the support for chunked transfer encoding and Fetch API wherever possible, as well as delivering data to the player as soon as it becomes available.
The indication of a low-latency stream is achieved through use of the Latency target and availabilityTimeOffset tags, which signal the target delay and allow for fragment loading to begin before the full segment formation is completed.
By utilizing these technologies, it is possible to achieve delays in the range of 2-6 seconds, depending on the configuration and settings of both the server-side and player-side components. Furthermore, there is backward compatibility, allowing devices that do not understand low-latency formats to playback full segments as before.
Apple LL-HLS offers several latency optimization solutions, including:
- Generating partial segments as short as 200 ms, marked by X-PART in the playlist and available before the full segment forms. Outdated partial segments are regularly removed and replaced by full ones.
- Sending updated playlists after updates occur, rather than upon direct request, allows the server to hold back delivery.
- Transmitting only playlist differences to reduce data transfer volume.
- Announcing soon-to-be-available partial segments with the new PRELOAD-HINT tag, enabling clients to request early and servers to respond once data is available.
- Facilitating faster video quality switching with the RENDITION-REPORT tag, which records information about the last segments and fragments of adjacent playlists.
Only Apple LL-HLS works natively on Apple devices, making its implementation necessary for low-latency streaming on these devices.
HESP (High Efficiency Stream Protocol) is an adaptive video streaming protocol based on HTTP designed to deliver ultra-low latency streaming. It is capable of delivering video with a delay of up to 2 seconds. Unlike previous solutions, HESP requires 10-20% less bandwidth for streaming by allowing the use of longer GOP (group of pictures) durations.
Using chunked transfer encoding, the player first receives a JSON manifest containing stream information and timing. The streaming process occurs in two streams: the Initialization Stream and the Continuation Stream.
From the Initialization Stream, the player can request images at any given time to initiate playback, as it only contains I frames (keyframes.) Once playback starts, the Continuation Stream is used, and the player can begin playback after receiving any image from the Initialization Stream.
This enables fast and uninterrupted video transmission and playback in the user’s player, as well as seamless quality switching. The illustration demonstrates an example where one video quality is initially played and then switched to another, with the Initialization Stream requested once.
To implement all these protocols, we decided to create our own solution. There are several reasons behind this decision:
- Independence from vendors: Relying on the quality of a third-party solution comes with challenges. For example, if any issues were to arise, we would be unable to address them until the vendor resolves them—assuming they are even willing to make the necessary changes and/or improvements.
- Gcore’s infrastructure: We have our own global infrastructure, which spans from processing servers to content delivery network. Our development team has the expertise and resources needed to implement our own solution.
- Common features among technologies integrated: The shared characteristics of the technologies we evaluated allow for seamless integration within a unified system.
- Customizable metrics and monitoring: With our own solution, we can set up metrics and monitoring according to our preferences and with our own customization options.
- Adaptability to our and our clients’ needs: Having our own solution enables us to quickly adapt and customize it to specific tasks and client requirements.
- Future development opportunities: Developing our own solution empowers us to evolve in any direction. As new protocols and technologies emerge, we can seamlessly add them to our existing stack.
- Backward compatibility with existing solutions: Backward compatibility with existing solutions is essential. We can carefully assess how any new innovations may impact clients who previously relied on our prior solution.
When considering the specific technologies, not all third-party solutions support Apple LL-HLS and HESP. For instance, the Apple Media Stream Segmenter is limited to MPEG-2 TS over UDP and only functions on MacOS, while it uploads files to the file system. The HESP packager + HTTP origin, on the other hand, transmits files via Redis and is written in TypeScript.
It’s important to note that relying on these external solutions consumes resources, introduces delays and dependencies, and can impact parallelism and scalability. Moreover, managing a diverse array of solutions can complicate maintenance and support.
The operation of our JIT Packager can be outlined as follows:
- The transcoder sends streams to our Packager.
- The Packager generates all the necessary segments and playlists on the fly.
- Clients request streams from the EDGE node of the CDN.
- Thanks to the API, CDN already knows which server to fetch the content from and retrieves it.
- The response is cached in the chunked-proxy’s memory.
- For all other clients, it is served directly from the cache.
On average, we achieved an approximate caching rate of 80%.
Let’s take a look at what we have accomplished with our JIT Packager.
We have successfully developed a unique JIT Packager capable of simultaneously streaming video in HLS, DASH, and all currently available low-latency streaming formats. Primarily, it accepts video and audio streams in fragmented MP4 format from the transcoder. The server directly extracts all necessary media data from the MP4 files and dynamically generates initialization segments, corresponding playlists, and video fragments for streaming in all mentioned streaming modes with minimal delays. Subsequently, the streams become available for distribution via a CDN.
Our solution operates within an internal network using HTTP/1.1 without TLS. TLS offers no benefits in this context and would only introduce unnecessary overhead, requiring us to encrypt the entire stream once again. Instead, data is transmitted using chunked transfer encoding.
As a result, we have not only developed a Packager but also an HTTP server capable of delivering video in all the previously mentioned formats. Moreover, the same video and audio streams are utilized for each format, ensuring efficient resource utilization.
We have implemented DVR functionality to allow users who have missed a live broadcast to rewind and catch up. All microsegments are stored in a separate cache on the server’s memory. Subsequently, they are merged and cached on disk as complete video fragments. These complete segments are then served during backward playback. DVR segments are automatically deleted after a certain period of time has elapsed.
When it comes to protocols utilizing chunked transfer encoding, it is important to note that not all CDNs support caching files before they are fully downloaded from the origin server. While nginx, acting as a proxy server, is capable of handling sources with chunked transfer and proxying their responses chunk by chunk, subsequent requests are bypassed and sent directly to the source until the entire response is completed. The cache is only utilized when the complete response is available. However, this approach proves ineffective for efficient scaling of low-latency video streaming, particularly when a significant number of viewers are likely to access the last segment simultaneously.
To address this challenge, we have implemented a separate caching service for chunked-proxy requests on each CDN node. Its key feature lies in the ability to cache partial HTTP responses. This means that while the first client initiating the request to the source receives its response, any number of clients desiring the same response will be served by our server with minimal overall delay. The already-received portions will be immediately delivered, while the rest will be provided as they arrive from the source. This caching service stores the passing requests in the server’s memory, allowing us to reduce latency compared to storing fragments on disk.
Memory usage limits are also taken into account. If the total cache size reaches the limit, elements are evicted based on the least recently accessed order. Furthermore, we have developed a specialized API that enables CDN edge nodes to proactively determine the content’s location in advance.
The development of our JIT Packager has allowed us to achieve our goals in low-latency streaming. We can stream through multiple advanced protocols simultaneously without relying on third-party vendors, significantly improving the user experience. We can promptly respond to incidents and adapt the solution to meet client needs more efficiently.
But we’re not stopping there. Our plans include further reducing latency while maintaining quality and playback stability. We are also working on optimizing the system as a whole, adding more metrics for monitoring and control, and continuing pushing the boundaries of innovation in the field.
We are excited about the possibilities ahead and remain dedicated to delivering our users high-quality, low-latency streaming experiences.