How much flakiness do you tolerate in end to end tests?

kersplort@programming.dev · edit-2 1 year ago

How much flakiness do you tolerate in end to end tests?

souperk@reddthat.com · 1 year ago

As always I would say there is a huge “it depends”.

For context, I am part of a small team of engineers, working on a relatively new product, we have continuous deployment setup for our release branches. We prefer many small PRs, think at least a PR a day per engineer.

I am responsible for setting up a new e2e test suite right now, so it’s possible I reconsider later on. But, there are a couple lessons learned from our previous iteration.

Our pipeline was slow (20-30 mins), flakiness was a no go. Decreasing pipeline time increased tolerance for flakiness.
Flakiness on the pipeline translated to flakiness on the production instances. When we started caring for those our sentry got much more happy.
We didn’t have the time to go back and fix issues, so we stopped having nightlies. If it’s important enough we should block merging on main and fix it.

i_stole_ur_taco@lemmy.ca · 1 year ago

My org has issues with e2e, but we keep them because they usually inform that something, somewhere isn’t quite right.

Our CI/CD pipeline is configured to automatically re-run a build if a test in the e2e suite fails. If it fails a second time, then it sends up the usual alerts and a human has to get involved.

In addition to that, we track transient failures on the main branch and have stats on which ones are the noisiest. Someone is always peeling the noisiest one off the stack to address why it’s failing (usually time zones or browser async issues that are easy to fix).

It’s imperfect, and still results in a lot of developers just spam retrying builds to “get stuff done”, but we’ve decided the signal to noise ratio is good enough that we want to keep things the way they are.

Pantoffel@feddit.de · edit-2 1 year ago

I think/hope that the wording you used was a mistake.

End to end tests do not introduce flakiness, but uncover it.

Whenever we discover flakiness, we try to fix it immediately. When there is no time for the fix (which is more than often the case) we create a ticket that vanishes in the backlog.

For a long time the company I currently work at didn’t have end to end tests save unit tests for a lot of their code.

Through a push of newcomers we finally managed to add end to end tests to many more parts of the code. However, these are still not properly documented. Some end to end tests overlap and some only cover a small part of one larger functionality. That is why we often find bugs that were introduced by us, because we had no end to end tests covering those parts.

We used to run end end tests only every night on the whole product. They usually take an hour or more to complete. This takes too long to run them before each merge. However, we have them organized enough such that for sub-product A we can run the sub-product A end to end tests only before each merge where we assume that we did only touch code affecting sub-product A. In case the code changes affected some other parts of the product, the nightly tests help us out. We are doing this in my team for a long while now. But we just recently started to establish this procedure in the other teams of the company, too.

kersplort@programming.dev · 1 year ago

My experience with E2E testing is that the tools and methods necessary to test a complex app are flaky. Waits, checks for text or selectors and custom form field navigation all need careful balancing to make the test effective. On top of this, there is frequently a sequentiality to E2E tests that causes these failures to multiply in frequency, as you’re at the mercy of not just the worst test, but the product of every test in sequence.

I agree that the tests cause less flakiness in the app itself, but I have found smokes inherently flaky in a way that unit and integration tests are not.

Pantoffel@feddit.de · 1 year ago

Okay I must admit that I do not have much experience with smoke and integration tests. We run end to end tests only and skip running the other two types entirely. They would be covered by the end to end tests anyways.

Perhaps I am lucky in that our software doesn’t require us to use many waits at all. Most things are synchronous and those that are not mostly have API endpoints where the status of the process an be safely queried, i.e. a wait(1000) and hope for the best is not necessary, but rather do wait(1000) until isFinished().

And yes, for us it is also a mess of errors popping up when one step in a pipeline fails, where many tests rely on this single step. I don’t know whether there is a way to approach this issue neatly. This is surely a chance in the market to be taken.

learningduck@programming.dev · 10 months ago

You run E2E test before each merge. So, you don’t merge very often?

How about running an integration test before each merge instead of a full fledged E2E and mocking out external dependencies (other services) during the test, then do E2E testing on a schedule like nightly?

I prefer it this way, because mocking out external dependencies cut out network instability and bugginess from dependencies. So, we can merge faster. Agree that test scenarios are overlapping, and if your E2E is very stable then it is probably not worth it, but unfortunately it’s not so stable in my environment.

Pantoffel@feddit.de · edit-2 10 months ago

Luckily, our e2e tests are pretty stable. And unfortunately we are not given the time to write integration tests as you describe. The good thing would be that with these mocks we were then also be able to load test single services instead of the whole product.

We merge multiple times a day and run only those e2e tests we think are relevant. Of course, this is not optimal and it is not too rare that one of the teams merges a regression, where one team or more talented at that than the others.

You see, we have issues and we realize we have them. Our management just thinks these are not important enough to spend time on writing integration tests. I think money and developer time are two of the reasons, but the lack of feature documentation, the lack of experts for parts of the codebase (some already left for another employer), and the amount of spaghetti code and infrastructure we have are other important reasons.

learningduck@programming.dev · 10 months ago

Reading the 3rd paragraph and I see myself 😄. Glad that you and the team managed to add another layer of testing successfully.

luckystarr@feddit.de · 1 year ago

deleted by creator

hoodlem@hoodlem.me · 1 year ago

Having the e2e smokes be a requirement for PR merge is frustrating. But I’ve been on the fence before with this on my own teams. It’s enticing to have a completely “clean” main branch that has not been infested with regressions caused by a PR.

It also gives you confidence in the crummy cases where you need to push a fix to prod right now.

If the e2e’s flap too much, then it is not an option. I’ve tried it and it lasts one sprint before we nix it. It’s just too frustrating and development comes to a standstill.

We have a retry policy on any smokes that need to run in a step by step order, and we aggressively prune and remove smokes that frequently fail or don’t test for real issues.

I actually think that’s the best way to handle it.

Who fixes the issues that the smokes find?

On teams I’ve been on, typically a junior dev. Sounds crummy, but it actually gives them more experience with the product/code. I have been that junior dev before and I found it a positive experience.

Pantoffel@feddit.de · 1 year ago

I find it interesting that the junior would fix these issues. In my team we established a fictive role which each developer takes in sprint rotation. The developer in this role would then handle these cases and also drive the investigation of incoming support requests / bug reports, playing third level support in a sense.

How do you handle support requests in your team/company? Is it the developers or do you have a dedicated team for that?

kersplort@programming.dev · 1 year ago

My team has just decided to make working smokes a mandatory part of merging a PR. If the smokes don’t work on your branch, it doesn’t merge to main. I’m somewhat conflicted - on one hand, we had frequent breaks in the smokes that developers didn’t fix, including ones that represented real production issues. On the other, smokes can fail for no reason and are time consuming to run.

We use playwright, running on github actions. The default free tier runner has been awful, and we’re moving to larger runners on the platform. We have a retry policy on any smokes that need to run in a step by step order, and we aggressively prune and remove smokes that frequently fail or don’t test for real issues.

AggressivelyPassive@feddit.de · 1 year ago

Shouldn’t you find out, why the smoke tests keep failing? It’s not a good sign, if you can’t even guarantee stability in a controlled environment.

kersplort@programming.dev · 1 year ago

It’s not a fully controlled environment, that is the point of smokes.

AggressivelyPassive@feddit.de · 1 year ago

How do you not fully control the environment your PRs are tested in?

kersplort@programming.dev · 1 year ago

dependencies.