Closed Bug 702813 Opened 13 years ago Closed 12 years ago

Pages with many DOM objects which don't live in the document can cause multi-second CC pauses

Tracking

()

Status:

RESOLVED FIXED

Tracking Flags:

Tracking

Status

firefox9

---

firefox10

---

People

(Reporter: justin.lebar+bug, Assigned: smaug)

References

(Blocks 1 open bug)

Details

(Whiteboard: [Snappy])

Attachments

(8 files, 5 obsolete files)

Synthetic benchmark 13 years ago Kyle Huey (Exited; not receiving bugmail, old account, do not use) 1.11 KB, text/html		Details
Synthetic benchmark 13 years ago Kyle Huey (Exited; not receiving bugmail, old account, do not use) 1.40 KB, text/html		Details
Counter Smaug's patch 13 years ago Andrew McCreight [:mccr8] 1.49 KB, text/html		Details
Individual Page Loads Top 999 (bzip2 Open Office ODF Spreadsheet) 13 years ago Bob Clary [:bc] (inactive) 146.47 KB, application/x-bzip2		Details
Continuous Page Loads Top 999 (bzip2 Open Office ODF Spreadsheet) 13 years ago Bob Clary [:bc] (inactive) 264.17 KB, application/x-bzip2		Details
top 10,000 sites gc times (csv.bz2) 13 years ago Bob Clary [:bc] (inactive) 255.88 KB, application/x-bzip2		Details
top 6900 gc times single session (csv.bz2) 13 years ago Bob Clary [:bc] (inactive) 345.34 KB, application/x-bzip2		Details
partial test with current nightly (csv) 13 years ago Bob Clary [:bc] (inactive) 39.39 KB, text/plain		Details
partial test with try (csv) 13 years ago Bob Clary [:bc] (inactive) 49.84 KB, text/plain		Details
partial test with current nightly (csv) 13 years ago Bob Clary [:bc] (inactive) 191.70 KB, text/plain		Details
partial test with try (csv) 13 years ago Bob Clary [:bc] (inactive) 153.69 KB, text/plain		Details
completed test nightly csv.bz2 12 years ago Bob Clary [:bc] (inactive) 198.12 KB, application/x-bzip2		Details
completed test try csv bz2 12 years ago Bob Clary [:bc] (inactive) 171.67 KB, application/x-bzip2		Details

Justin Lebar (not reading bugmail)

Reporter

Description

•

13 years ago

This is likely the root cause of bug 700645 and bug 701443.

We need a simple, synthetic testcase to prove that this is the cause.  Then we presumably need to figure out how to modify the CC to avoid failing so hard here.

Andrew McCreight [:mccr8]

Comment 1

•

13 years ago

Now that I think about it, I should be able to analyze the CC logs from you and from dvander to see what the deal is with the DOM nodes in there.

The basic problem is that for "normal" DOM nodes, the CC will see that it is a part of a window that is currently being displayed, and then not traverse through the entire thing.  In at least jlebar's IRCcloud case, the page is generating huge amounts of nodes that are not in the currently displayed window, but are somehow still live.  The question is, can we somehow test that these are live?

The first step is to figure out why these are alive, and then see if we can link the DOM nodes to what is holding them alive.

Andrew McCreight [:mccr8]

Comment 2

•

13 years ago

One thing I need to look at is whether these nodes are truly free floating, or they are part of a separate DOM tree.  I have a crazy idea for handling the latter case, by treating the entire DOM tree as a single node in the CC graph, and write barriering any thing like set user data that creates edges out of the graph.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 3

•

13 years ago

Attached file Synthetic benchmark (obsolete) — Details

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 4

•

13 years ago

In the synthetic benchmark, the failure mode involves a disconnected subtree with a million nodes.  The relevant code is:

function touch3()
{
  var thing3 = doc.createElement("p");
  last.appendChild(thing3);
}

The key is that after the function returns, thing3 can be GCd.  thing3 is an xpconnect wrapper (a slim wrapper here, though a regular wrapper works too) and when it's GCd it Releases the native object which causes it to get suspected (or, with a regular xpconnect wrapper, some xpconnect thing which holds the native object gets suspected).

Then when the CC runs, the suspected native for thing3 ends up traversing the entire disconnected subtree (the million nodes).  Since the nodes aren't in the document, the INTERRUPTED_TRAVERSE stuff doesn't save us.  If you load this in a clean profile you'll see this quite clearly.  After you click your way through a bunch of slow script dialogs, your CC time will be extremely low (around 10ms on my machine).  If you click on the 3rd button, the next CC will be quite long (around 1s on my machine).

I think in some sense this is a regression from the strong parent pointer stuff.  The architectural issue way there before, but now that parent pointers are strong we end up traversing the entire disconnected subtree as opposed to some subsubtree.

Peter Van der Beken [:peterv]

Comment 5

•

13 years ago

One thing we could maybe do is not traverse a node if its JS object was marked black by the GC.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 6

•

13 years ago

That doesn't help us very much, though (at least in the worst case).  If the one native whose JS object is black is on the other side of the subtree we still end up traversing almost all of it.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 7

•

13 years ago

If you load the synthetic benchmark in Firefox 8 (the last version without strong parent pointers) you'll find that the CC pause is negligible (as expected) and that the testcase loads much, much, much faster.  I'm not sure whether the latter is connected to the former or not.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 8

•

13 years ago

Nominating for tracking on the basis that we have a synthetic benchmark that shows massive performance regressions.

tracking-firefox10: --- → ?

tracking-firefox9: --- → ?

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 9

•

13 years ago

Apparently the loading pause only shows up in my local build (not even on nightlies ...) but the CC slowness is definitely there.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 10

•

13 years ago

Attached file Synthetic benchmark — Details

The previous benchmark with one addition, a case that causes the high CC times in weak-parent-pointer builds.

Attachment #574861 - Attachment is obsolete: true

Olli Pettay [:smaug][bugs@pettay.fi]

Assignee

Comment 11

•

13 years ago

So strong parentNode did add a new way to cause huge CC graphs, but there were other ways to create
such graphs even with weak parentNode.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 12

•

13 years ago

Sure, but it looks like this is being hit by web content now, while it wasn't before.

Olli Pettay [:smaug][bugs@pettay.fi]

Assignee

Comment 13

•

13 years ago

If we decide to disable strong parentNode in  9 (and 10), that decision must happen *very* soon,
during this week or so.
Backing out should be doable, but it will touch lots of code and would take some time
to get all the backout patches ready.

And, if we back out, I think we should back out from trunk too until cycle collector can handle
the situation better.

But, I'm still hoping we could find out some, perhaps hacky, fix for cycle collector handling.

Andrew McCreight [:mccr8]

Comment 14

•

13 years ago

Before we panic too much, we should try to figure out how much this actually hits web content.  We've seen a few weird cases that maybe cause problems, but only Asa has been able to consistently recreate a problem.  (I haven't confirmed that dvanderson's case is actually this.  I'll try to put together a script today that can find these problems.)

We should look at telemetry, and compare what 8 looks like to 9 and 10.  I think the best measure to look at is the number of ref counted nodes in the CC graph.  That will avoid problems due to differing hardware, though not websites.  Is there an appreciable spike?

Secondly, precisely what are we trying to fix?  Cases where the top level node is being held live by JS?  Where any node is being held live by JS?  Where any node is being held live for any reason?

I have some ideas for the first two.

Olli Pettay [:smaug][bugs@pettay.fi]

Assignee

Comment 15

•

13 years ago

An unoptimized black node hack (which doesn't leak with Kyle's test, and thing3 test's
CC time drops in a debug build from 750 to 15ms)
The algorithm is silly (it misses marking nodes, which leads to iterating a lot),
and that's why could cause slowness in certain cases.
Pushed to try to see if it leaks :)
https://hg.mozilla.org/try/rev/6c55fa768be8
https://tbpl.mozilla.org/?tree=Try&rev=6c55fa768be8

But anyway, I hope the patch shows that perhaps there are simple ways to fix at least the
most common problems related not-in-document subtrees.

Andrew McCreight [:mccr8]

Comment 16

•

13 years ago

Yeah, this patch is going to be something like n log n in the worst case.  n being the number of nodes, and log n being the depth of the tree.  Or something like that.

Can we somehow move this check into the cycle collector itself?  We could associate with owner documents if we've ever found a member document that has a nongray wrapper that is keeping it alive.  Then, while we are traversing up the parents, we add the parents to the graph.  This will keep them from being traversed twice, at the cost of bloating up the graph a bit before we find the wrapped node.

Are these nodes with marked wrappers going to be traversed as XPConnectRoots?  Those are added to the graph first, so potentially if we store nongrayness in the owner document then we don't have to actually traverse up the parent chain: by the time we get to nodes in the "meat" of the DOM, we'll already have examined if any of the members of the DOM are being held alive in this way.

I should also make sure this will actually help with the IRCcloud graph I got from jlebar.  I looked at the paths that were holding things live, and they followed some weird path through JS back out to a DOM, but the path finding can be a little wonky.

Olli Pettay [:smaug][bugs@pettay.fi]

Assignee

Comment 17

•

13 years ago

(In reply to Andrew McCreight [:mccr8] from comment #16)
 
> Can we somehow move this check into the cycle collector itself?  We could
> associate with owner documents if we've ever found a member document that
> has a nongray wrapper that is keeping it alive.
I'm not sure what this means. "member document"?

>  Then, while we are
> traversing up the parents, we add the parents to the graph.  This will keep
> them from being traversed twice, at the cost of bloating up the graph a bit
> before we find the wrapped node.
Adding anything to graph and using ::Traverse is slow comparing to
non-virtual inline calls what can be used when dealing nodes.
 
> Are these nodes with marked wrappers going to be traversed as
> XPConnectRoots?  Those are added to the graph first, so potentially if we
> store nongrayness in the owner document
But we can't store non-grayness in the owner document. We're dealing with
unconnected subtrees here. ownerDocument doesn't own the nodes in that case.

> then we don't have to actually
> traverse up the parent chain: by the time we get to nodes in the "meat" of
> the DOM, we'll already have examined if any of the members of the DOM are
> being held alive in this way.
This would work if ownerDocument would keep disconnected subtrees alive, but it doesn't.

> I should also make sure this will actually help with the IRCcloud graph I
> got from jlebar.  I looked at the paths that were holding things live, and
> they followed some weird path through JS back out to a DOM, but the path
> finding can be a little wonky.
To fix the regression-like part of this all for 9/10 it should be enough to figure out if the root of the subtree is
black, since in 8 it is necessary to keep root explicitly alive to keep subtree alive.

This could be optimized so that if we're dealing with a tree which has document as root,
we mark that document as black and then it is enough to check node->GetCurrentDoc()->IsBlack():

But in cases of those trees where some other node type is root, we would need to mark all the ancestors,
and clear the markers after CC. That would be O(2n) (where n=number of nodes in the tree)

Btw, the hacky patch passed tests on try without leaks.

Andrew McCreight [:mccr8]

Comment 18

•

13 years ago

Okay, most of my comments were just me being confused about the document...

Andrew McCreight [:mccr8]

Updated

•

13 years ago

Blocks: 698919

Andrew McCreight [:mccr8]

Comment 19

•

13 years ago

Attached file Counter Smaug's patch — Details

Here's a test example that is a minor variant of Kyle's that causes slow CC times even with Smaug's patch.  The basic trick is here:

  var nu_root = document.createElement("span");
  root = document.createElement("a");
  nu_root.appendChild(root);

Instead of storing the tippy-topmostnode (nu_root) in a JS variable, we store one of its children.  Thus when we suspect some DOM node deep in the tree, we don't find the node with the marked wrapper.

And as I said before, in the IRCcloud case, with jlebar's log, the DOM isn't being held alive by a marked wrapper, so this optimization may not work there either.  The topmost element in the DOM, which appears to be a <div>, has a preserved wrapper, but the wrapper is not marked.  I didn't check every node on the way up.  I also made the assumption that a random element I picked must be part of the huge glob.

Andrew McCreight [:mccr8]

Comment 20

•

13 years ago

I looked at telemetry and didn't see a particularly huge difference in ref counted CC graph size between 8 and 9.  Looking at CC times, they seem about the same too.

Olli Pettay [:smaug][bugs@pettay.fi]

Assignee

Comment 21

•

13 years ago

(In reply to Andrew McCreight [:mccr8] from comment #19)
> Created attachment 575341 [details]
> Counter Smaug's patch
> 
> Here's a test example that is a minor variant of Kyle's that causes slow CC
> times even with Smaug's patch.  The basic trick is here:
> 
>   var nu_root = document.createElement("span");
>   root = document.createElement("a");
>   nu_root.appendChild(root);
> 
> Instead of storing the tippy-topmostnode (nu_root) in a JS variable, we
> store one of its children.
Sure, but that doesn't even work in 8 - I mean in 8 we'd delete nu_root.
So I'm not too worried about that case, since I expect web site to actually store the
root somewhere because of the Gecko bug in <= 8.


> And as I said before, in the IRCcloud case, with jlebar's log, the DOM isn't
> being held alive by a marked wrapper, so this optimization may not work
> there either.
>  The topmost element in the DOM, which appears to be a <div>,
> has a preserved wrapper, but the wrapper is not marked.  I didn't check
> every node on the way up.  I also made the assumption that a random element
> I picked must be part of the huge glob.
But something must be keeping the subtree alive. In 8 if you don't explicitly keep the root of some subtree
alive, the root will be delete, and all its children disconnected from it. This can lead to several
disconnected subtrees.

Andrew McCreight [:mccr8]

Updated

•

13 years ago

Depends on: 702609

Andrew McCreight [:mccr8]

Comment 22

•

13 years ago

smaug: I've been running the try build you linked in the other threads, with both the AMCtv website and IRCcloud going.  heap-unclassified has held fairly steady, below 30%.  Maybe it is increasing slowly.  My CC pause times are a little weird.  About 7/8 are reasonable-ish, around 200 to 300ms.  About 1/8 are longer, around 750ms to 1000ms.  The long CC was getting worse and worse, but dropped down to the low end again.  There's no reason I can see that it has gotten better, and there's nothing in particular that I've noticed about why the longer one is longer.  Sometimes they are right after a GC, sometimes there's another CC in between.

Andrew McCreight [:mccr8]

Comment 23

•

13 years ago

I mean the patch in bug 702609.

Olli Pettay [:smaug][bugs@pettay.fi]

Assignee

Comment 24

•

13 years ago

I'd be interested to know how the patch affects to those cases when heap-unclassified is high.

Andrew McCreight [:mccr8]

Updated

•

13 years ago

Depends on: 704623

Alex Keybl [:akeybl]

Updated

•

13 years ago

tracking-firefox10: ? → +

tracking-firefox9: ? → +

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 25

•

13 years ago

Are we going to take any countermeasures here for 9?  The clock is almost out ...

Olli Pettay [:smaug][bugs@pettay.fi]

Assignee

Comment 26

•

13 years ago

if the problem is strong parent node, it is quite strange that we got first bugs filed 3.5 months
after the patch landed to trunk.
Also, some reports say that there is a regression from 9->10
(which would mean some totally different problem)

But in any case I can prepare the strong-parentNode-backout patch for 9.

Olli Pettay [:smaug][bugs@pettay.fi]

Assignee

Updated

•

13 years ago

Depends on: 708572

Lawrence Mandel [:lmandel] (use needinfo)

Updated

•

13 years ago

Whiteboard: [MemShrink:P1] → [MemShrink:P1][Snappy]

chris hofmann

Comment 27

•

13 years ago

I wonder if bc's spider could let us know which top sites seem to trigger gc these problems.

Bob Clary [:bc] (inactive)

Comment 28

•

13 years ago

I know there have been lots of additions of logging capability for memory, gc and cc but I haven't kept up on the details. Can someone point me in the right direction to find out more about how to use them>

Bob Clary [:bc] (inactive)

Comment 29

•

13 years ago

javascript.options.mem.log appears to be at least part of what I need. 

I can get the time in seconds a GC or CC occurs since the browser started from GC(T+time) or CC(T+time), and the duration of CC's from the duration: [0-9]+ ms.$ in the CC output line.

What are the other interesting fields from the output lines or should I just collect them all in a csv for further analysis?

Would you want the results for starting the browser, loading the page, waiting for X seconds, then stopping the browser or do you want the results for a continuous run where the browser starts then successively loads pages with a delay of X seconds from the top sites while recording the GC/CC output? What should be the wait period to let the browser sit on a page?

Olli Pettay [:smaug][bugs@pettay.fi]

Assignee

Comment 30

•

13 years ago

I'd like to know about pages which, say after 15s of page load cause >500ms CC times.
(exact!) steps to reproduce would be great.

Olli Pettay [:smaug][bugs@pettay.fi]

Assignee

Comment 31

•

13 years ago

Btw, a patch landed on Friday which changes CC handling in some cases.
If you see any regressions, please file new bugs and make them block Bug 712743.
If you see lower CC times, report to bug Bug 712743.

Bob Clary [:bc] (inactive)

Comment 32

•

13 years ago

Attached file Individual Page Loads Top 999 (bzip2 Open Office ODF Spreadsheet) — Details

Bob Clary [:bc] (inactive)

Comment 33

•

13 years ago

Attached file Continuous Page Loads Top 999 (bzip2 Open Office ODF Spreadsheet) — Details

Bob Clary [:bc] (inactive)

Comment 34

•

13 years ago

I captured the GC and CC times on two separate Linux VMs using today's Nightly build for the top 999 sites in two cases:

1) Individual Loading - Start the browser, Load the Page and wait 15 seconds, Shutdown the browser.
2) Continuous Loading - Start the browser, Load each Page in succession waiting 15 seconds after each page, Shutdown the browser.

As you can see there are a few spikes for individual sites but none over 500 ms (one CC up to 410 ms) and perhaps a secular growth trend for the continuous loading case but 999 sites is not sufficient I believe. If this presentation is good enough for you, I can produce results for a larger set of pages.

Would it have been better to accumulate the total GC+CC for each site rather than reporting the results of each individual GC/CC ?

Bob Clary [:bc] (inactive)

Comment 35

•

13 years ago

Attached file top 10,000 sites gc times (csv.bz2) — Details

The test run where the browser is started, the page loaded, and the the browser is shut down after 15 seconds has completed. I've attached a bzip2d csv file of the data since it is smaller. It is trivial to import into a spreadsheet and to create graphs.

The top 10 times were:

Site                       gctype gcduration
http://jsonline.com/       GC     1396.5
http://haaretz.co.il       GC     1258.2
http://crictime.com/       GC      646.6
http://haaretz.com/        GC      303
http://ilemi.me/           GC      289
http://mako.co.il/         CC      288
http://archdaily.com/      CC      233
http://hindustantimes.com/ GC      232.9
http://redmondpie.com/     CC      224
http://deadspin.com/       CC      192

Bob Clary [:bc] (inactive)

Comment 36

•

13 years ago

Attached file top 6900 gc times single session (csv.bz2) — Details

After some hangs and restarts the test run where each of the top 10,000 sites is loaded during a single browser session is still running and has completed 6900 sites. I've attached the bzip2d csv file of the data for your analysis. I'll let the run complete but consider this data sufficient and won't attach the full set unless someone requests it.

Note the secular increases in the gc times and their subsequent drop offs. The drop offs correspond to my manually closing the number of pop up windows that had been opened by the web sites. The gc times increase as the number of open windows increases and drops back to "normal" levels after the extra windows are closed.

Bob Clary [:bc] (inactive)

Comment 37

•

13 years ago

Attached file partial test with current nightly (csv) (obsolete) — Details

Bob Clary [:bc] (inactive)

Comment 38

•

13 years ago

Attached file partial test with try (csv) (obsolete) — Details

with http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/opettay@mozilla.com-9687bb33e6ea/

Bob Clary [:bc] (inactive)

Comment 39

•

13 years ago

Attached file partial test with current nightly (csv) (obsolete) — Details

Attachment #585913 - Attachment is obsolete: true

Bob Clary [:bc] (inactive)

Comment 40

•

13 years ago

Attached file partial test with try (csv) (obsolete) — Details

Attachment #585914 - Attachment is obsolete: true

Bob Clary [:bc] (inactive)

Comment 41

•

13 years ago

           nightly  try
min     GC     0      0
average GC    21     20
max     GC   580    223
stddev  GC    23     14
min     CC     0      0
average CC     8     17
max     CC   581    328
stddev  CC    15     22

Alex Keybl [:akeybl]

Comment 42

•

13 years ago

Hey Bob - sending this tracked, yet unassigned, bug over to you since I think you've taken next action.

Assignee: nobody → bclary

Bob Clary [:bc] (inactive)

Comment 43

•

13 years ago

Ouch! That's hot! (http://en.wikipedia.org/wiki/Hot_Potato_%28game%29) Actually Smaug is doing the work. I'm just helping out with some data collection.

Assignee: bclary → bugs

Bob Clary [:bc] (inactive)

Comment 44

•

12 years ago

Attached file completed test nightly csv.bz2 — Details

Attachment #586055 - Attachment is obsolete: true

Bob Clary [:bc] (inactive)

Comment 45

•

12 years ago

Attached file completed test try csv bz2 — Details

Attachment #586056 - Attachment is obsolete: true

Bob Clary [:bc] (inactive)

Comment 46

•

12 years ago

           nightly  try
min     GC     0      0
average GC    20     22
max     GC   796   8872 *
stddev  GC    22     64
min     CC     0      0
average CC     7     19
max     CC   581    328
stddev  CC    11     24

[*] not reproducible.

Andrew McCreight [:mccr8]

Comment 47

•

12 years ago

Not really a MemShrink issue, as there's no leak here.  The lack of about:memory coverage for disconnected DOMs is being tracked in bug 704623.

Whiteboard: [MemShrink:P1][Snappy] → [Snappy]

Andrew McCreight [:mccr8]

Comment 48

•

12 years ago

Looks fixed to me, on a recent Nightly.  CC times never go about say 50ms with the two example files.  They still cause 5 second CCs when you close the tab, and the CC frees a million nodes, but that's a separate issue.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

David Zbarsky (:dzbarsky)

Updated

•

12 years ago

No longer blocks: 698919

David Zbarsky (:dzbarsky)

Updated

•

12 years ago

Blocks: 698919

Mohammad Khan

Updated

•

7 years ago

Flags: needinfo?(seorizwan)

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

Flags: needinfo?(seorizwan)

Nobody; OK to take it and work on it

Updated

•

5 years ago

Component: DOM → DOM: Core & HTML

Comment hidden (spam)

:glob ✱

Updated

•

5 years ago

Restrict Comments: true

You need to log in before you can comment on or make changes to this bug.