Netflix App Testing At Scale. Learn how Netflix dealt with the… | by Jose Alcérreca | Android Developers | Apr, 2025

Netflix App Testing At Scale. Learn how Netflix dealt with the… | by Jose Alcérreca | Android Developers | Apr, 2025

Home » News » Netflix App Testing At Scale. Learn how Netflix dealt with the… | by Jose Alcérreca | Android Developers | Apr, 2025
Table of Contents

That is a part of the Testing at scale sequence of articles the place we requested business specialists to share their testing methods. On this article, Ken Yee, Senior Engineer at Netflix, tells us concerning the challenges of testing a playback app at a large scale and the way they’ve developed the testing technique for the reason that app was created 14 years in the past!

Testing at Netflix repeatedly evolves. With a purpose to totally perceive the place it’s going and why it’s in its present state, it’s additionally vital to know the historic context of the place it has been.

The Android app was began 14 years in the past. It was initially a hybrid utility (native+webview), but it surely was transformed over to a totally native app due to efficiency points and the issue in having the ability to create a UI that felt/acted actually native. As with most older functions, it’s within the strategy of being transformed to Jetpack Compose. The present codebase is roughly 1M traces of Java/Kotlin code unfold throughout 400+ modules and, like most older apps, there’s additionally a monolith module as a result of the unique app was one large module. The app is dealt with by a crew of roughly 50 individuals.

At one level, there was a devoted cell SDET (Software program Developer Engineer in Check) crew that dealt with writing all gadget checks by following the standard circulation of working with builders and product managers to know the options they have been testing to create take a look at plans for all their automation checks. At Netflix, SDETs have been builders with a deal with testing; they wrote Automation checks with Espresso or UIAutomator; additionally they constructed frameworks for testing and built-in third occasion testing frameworks. Function Builders wrote unit checks and Robolectric checks for their very own code. The devoted SDET crew was disbanded a number of years in the past and the automation checks are actually owned by every of the function subteams; there are nonetheless 2 supporting SDETs who assist out the assorted groups as wanted. QA (High quality Assurance) manually checks releases earlier than they’re uploaded as a ultimate “smoke take a look at”.

Within the media streaming world, one attention-grabbing problem is the massive ecosystem of playback gadgets utilizing the app. We wish to assist an excellent expertise on low reminiscence/sluggish gadgets (e.g. Android Go gadgets) whereas offering a premium expertise on greater finish gadgets. For foldables, some don’t report a hinge sensor. We assist gadgets again to Android 7.0 (API24), however we’re setting our minimal to Android 9 quickly. Some manufacturer-specific variations of Android even have quirks. Consequently, bodily gadgets are an enormous a part of our testing

As talked about, function builders now deal with all elements of testing their options. Our testing layers appear like this:

Check Pyramid displaying layers from backside to high of: unit checks, screenshot checks, E2E automation checks, smoke checks

Nonetheless, due to our heavy utilization of bodily gadget testing and the legacy elements of the codebase, our testing pyramid appears to be like extra like an hourglass or inverted pyramid relying on which a part of the code you’re in. New options do have this extra typical testing pyramid form.

Our screenshot testing can also be finished at a number of ranges: UI element, UI display structure, and gadget integration display structure. The primary two are actually unit checks as a result of they don’t make any community calls. The final is an alternative choice to most guide QA testing.

Unit checks are used to check enterprise logic that isn’t depending on any particular gadget/UI conduct. In older elements of the app, we use RxJava for asynchronous code and streams are examined. Newer elements of the app use Kotlin Flows and Composables for state flows that are a lot simpler to motive about and take a look at in comparison with RxJava.

Frameworks we use for unit testing are:

  • Strikt: for assertions as a result of it has a fluent API like AssertJ however is written for Kotlin
  • Turbine: for the lacking items in testing Kotlin Flows
  • Mockito: for mocking any advanced courses not related for the present Unit of code being examined
  • Hilt: for substituting take a look at dependencies in our Dependency Injection graph
  • Robolectric: for testing enterprise logic that has to work together in a roundabout way with Android companies/courses (e.g., parcelables or Providers)
  • A/B take a look at/function flag framework: permits overriding an automation take a look at for a selected A/B take a look at or enabling/disabling a selected function

Builders are inspired to make use of plain unit checks earlier than switching to Hilt or Robolectric as a result of execution time goes up 10x with every step when going from plain unit checks -> Hilt -> Robolectric. Mockito additionally slows down builds when utilizing inline mocks, so inline mocks are discouraged. Gadget checks are a number of orders of magnitude slower than any of those sorts unit checks. Velocity of testing is vital in giant codebases.

As a result of unit checks are blocking in our CI pipeline, minimizing flakiness is extraordinarily vital. There are typically two causes for flakiness: leaving some state behind for the subsequent take a look at and testing asynchronous code.

JVM (Java Digital Machine) Unit take a look at courses are created as soon as after which the take a look at strategies in every class are known as sequentially; instrumented checks as compared are run from the beginning and the one time it can save you is APK set up. Due to this, if a take a look at methodology leaves some modified world state behind in dependent courses, the subsequent take a look at methodology can fail. International state can take many kinds together with information on disk, databases on disk, and shared courses. Utilizing dependency injection or recreating something that’s modified solves this concern.

With asynchronous code, flakiness can all the time occur as a number of threads change various things. Check Dispatchers (Kotlin Coroutines) or Check Schedulers (RxJava) can be utilized to regulate time in every thread to make issues deterministic when testing a selected race situation. This may make the code much less life like and presumably miss some take a look at eventualities, however will stop flakiness within the checks.

Screenshot testing frameworks are vital as a result of they take a look at what’s seen vs. testing conduct. Consequently, they’re one of the best alternative for guide QA testing of any screens which are static (animations are nonetheless troublesome to check with most screenshot testing frameworks until the framework can management time).

We use a wide range of frameworks for screenshot testing:

  • Paparazzi: for Compose UI parts and display layouts; community calls can’t be made to obtain photos, so it’s a must to use static picture sources or a picture loader that attracts a sample for the requested photos (we do each)
  • Localization screenshot testing: captures screenshots of screens within the working app in all locales for our UX groups to confirm manually
  • Gadget screenshot testing: gadget testing used to check visible conduct of the working app

Espresso accessibility testing: that is additionally a type of screenshot testing the place the sizes/colours of assorted components are checked for accessibility; this has additionally been considerably of a ache level for us as a result of our UX crew has adopted the WCAG 44dp normal for minimal contact measurement as an alternative of Android’s 48dp.

Lastly, we’ve got gadget checks. As talked about, these are magnitudes slower than checks that may run on the JVM. They’re a alternative for guide QA and used to smoke take a look at the general performance of the app.

Nonetheless, since working a totally working app in a take a look at has exterior dependencies (backend, community infra, lab infra), the gadget checks will all the time be flaky in a roundabout way. This can’t be emphasised sufficient: regardless of having retries, gadget automation checks will all the time be flaky over an prolonged time frame. Additional beneath, we’ll cowl what we do to deal with a few of this flakiness.

We use these frameworks for gadget testing:

  • Espresso: majority of gadget checks use Espresso which is Android’s important instrumentation testing framework for person interfaces
  • PageObject take a look at framework: inner screens are written as PageObjects that checks can management to ease migration from XML layouts to Compose (see beneath for extra particulars)
  • UIAutomator: a small “smoke take a look at” set of checks makes use of UIAutomator to check the totally obfuscated binary that may get uploaded to the app retailer (a.ok.a., Launch Candidate checks)
  • Efficiency testing framework: measures load occasions of assorted screens to examine for any regressions
  • Community seize/playback framework: permits playback of recorded API calls to scale back instability of gadget checks
  • Backend mocking framework: checks can ask the backend to return particular outcomes; for instance, our dwelling web page has content material that’s totally pushed by suggestion algorithms so a take a look at can’t deterministically search for particular titles until the take a look at asks the backend to return particular movies in particular states (e.g. “leaving quickly”) and particular rows stuffed with particular titles (e.g. a Coming Quickly row with particular movies)
  • A/B take a look at/function flag framework: permits overriding an automation take a look at for a selected A/B take a look at or enabling/disabling a selected function
  • Analytics testing framework: used to confirm a sequence of analytics occasions from a set of display actions; analytics are essentially the most liable to breakage when screens are modified so this is a vital factor to check.

The PageObject design sample began as an internet sample, however has been utilized to cell testing. It separates take a look at code (e.g. click on on Play button) from screen-specific code (e.g. the mechanics of clicking on a button utilizing Espresso). Due to this, it enables you to summary the take a look at from the implementation (suppose interfaces vs. implementation when writing code). You may simply change the implementation as wanted when migrating from XML Layouts to Jetpack Compose layouts however the take a look at itself (e.g. testing login) stays the identical.

Along with utilizing PageObjects to outline an abstraction over screens, we’ve got an idea of “Check Steps”. A take a look at consists of take a look at steps. On the finish of every step, our gadget lab infra will routinely create a screenshot. This provides builders a storyboard of screenshots that present the progress of the take a look at. When a take a look at step fails, it’s additionally clearly indicated (e.g., “couldn’t click on on Play button”) as a result of a take a look at step has a “abstract” and “error description” area.

Inside of a device lab cage
Inside a tool lab cage

Netflix was in all probability one of many first corporations to have a devoted gadget testing lab; this was earlier than third occasion companies like Firebase Check Lab have been accessible. Our lab infrastructure has quite a lot of options you’d count on to have the ability to do:

  • Goal particular forms of gadgets
  • Seize video from working a take a look at
  • Seize screenshots whereas working a take a look at
  • Seize all logs

Attention-grabbing gadget tooling options which are uniquely Netflix:

  • Mobile tower so we will take a look at wifi vs. mobile connections; Netflix has their very own bodily mobile tower within the lab that the gadgets are configured to hook up with.
  • Community conditioning so sluggish networks may be simulated
  • Automated disabling of system updates to gadgets to allow them to be locked at a selected OS degree
  • Solely makes use of uncooked adb instructions to put in/run checks (all this infrastructure predates frameworks like Gradle Managed Units or Flank)
  • Working a collection of automated checks in opposition to an A/B checks
  • Check {hardware}/software program for verifying {that a} gadget doesn’t drop frames for our companions to confirm their gadgets assist Netflix playback correctly; we even have a qualification program for gadgets to verify they assist HDR and different codecs correctly.

In case you’re interested in extra particulars, take a look at Netflix’ tech weblog.

As talked about above, take a look at flakiness is without doubt one of the hardest issues about inherently unstable gadget checks. Tooling must be constructed to:

  • Reduce flakiness
  • Establish causes of flakes
  • Notify groups that personal the flaky checks

Tooling that we’ve constructed to handle the flakiness:

  • Robotically identifies the PR (Pull Request) batch {that a} take a look at began to fail in and notifies PR authors that they brought about a take a look at failure
  • Exams may be marked steady/unstable/disabled as an alternative of utilizing @Ignore annotations; that is used to disable a subset of checks briefly if there’s a backend concern in order that false positives should not reported on PRs
  • Automation that figures out whether or not a take a look at may be promoted to Secure through the use of spare gadget cycles to routinely consider take a look at stability
  • Automated IfTTT (If This Then That) guidelines for retrying checks or ignoring momentary failures or repairing a tool
  • Failure report allow us to simply filter failures in accordance with what gadget maker, OS, or cage the gadget is in, e.g. this exhibits how typically a take a look at fails over a time frame for these environmental components:
Check failures over time grouped by environmental components like staging/prod backend, OS model, cellphone/pill
  • Failure report lets us triage error historical past to establish the commonest failure causes for a take a look at together with screenshots:
  • Exams may be manually set as much as run a number of occasions throughout gadgets or OS variations or gadget sorts (cellphone/pill) to breed flaky checks

We have now a typical PR (Pull Request) CI pipeline that runs unit checks (contains Paparazzi and Robolectric checks), lint, ktLint, and Detekt. Working roughly 1000 gadget checks is a part of the PR course of. In a PR, a subset of smoke checks can also be run in opposition to the totally obfuscated app that may be shipped to the app retailer (the earlier gadget checks run in opposition to {a partially} obfuscated app).

Further gadget automation checks are run as a part of our post-merge suite. At any time when batches of PRs are merged, there’s extra protection offered by automation checks that can’t be run on PRs as a result of we attempt to maintain the PR gadget automation suite below half-hour.

As well as, there are Day by day and Weekly suites. These are run for for much longer automation checks as a result of we attempt to maintain our post-merge suite below 120 minutes. Automation checks that go into these are usually lengthy working stress checks (e.g., are you able to watch a season of a sequence with out the app working out of reminiscence and crashing?).

In an ideal world, you will have infinite sources to do all of your testing. In case you had infinite gadgets, you might run all of your gadget checks in parallel. In case you had infinite servers, you might run all of your unit checks in parallel. In case you had each, you might run all the pieces on each PR. However in the true world, you will have a balanced strategy that runs “sufficient” checks on PRs, postmerge, and many others. to forestall points from getting out into the sphere so your clients have a greater expertise whereas additionally protecting your groups productive.

Protection on gadgets is a set of tradeoffs. On PRs, you wish to maximize protection however reduce time. On post-merge/Day by day/Weekly, time is much less vital.

When testing on gadgets, we’ve got a two dimensional matrix of OS model vs. gadget sort (cellphone/pill). Structure points are pretty widespread, so we all the time run checks on cellphone+pill. We’re nonetheless including automation to foldables, however they’ve their very own challenges like having the ability to take a look at layouts earlier than/after/through the folding course of.

On PRs, we usually run what we name a “slender grid” which implies a take a look at can run on any OS model. On Postmerge/Day by day/Weekly, we run what we name a “full grid” which implies a take a look at runs on each OS model. The tradeoff is that if there’s an OS-specific failure, it may appear like a flaky take a look at on a PR and gained’t be detected till later.

Testing repeatedly evolves as you be taught what works or new applied sciences and frameworks turn out to be accessible. We’re presently evaluating utilizing emulators to hurry up our PRs. We’re additionally evaluating utilizing Roborazzi to scale back device-based screenshot testing; Roborazzi permits testing of interactions whereas Paparazzi doesn’t. We’re build up a modular “demo app” system that permits for feature-level testing as an alternative of app-level testing. Enhancing app testing by no means ends…

Supply hyperlink

author avatar
roosho Senior Engineer (Technical Services)
I am Rakib Raihan RooSho, Jack of all IT Trades. You got it right. Good for nothing. I try a lot of things and fail more than that. That's how I learn. Whenever I succeed, I note that in my cookbook. Eventually, that became my blog. 
share this article.

Enjoying my articles?

Sign up to get new content delivered straight to your inbox.

Please enable JavaScript in your browser to complete this form.
Name