I don't know if it was reported, but anyway here are my latest issues.
I'm using now Snow Leopard with Java 6 64bits, and decided to build from hg (rev 1952). I stopped doing hg build for some time, so I don't think it's Snow Leopard related.
1) Strange test assertion for me in testSys::RepoTest ?
I'm creating the IntelliJ project that represent the Fan source code, and I hope to have a shot at debugging this soon if you don't know where it comes from.
HTH, Fred.
brianMon 14 Sep 2009
1) I have seen that on OS X before too - I believe it is a file system timestamp thing, where the Windows is a little quicker to notice the a modified file and re-load. Maybe a longer pause might fix it. Does anybody know if OS X support sub-second resolution on file timestamps?
2) Is it possible to get a stack dump from this when it happens? I know on Windows Ctrl+Break will dump the current threads to stdout.
freddy33Mon 14 Sep 2009
1) Good point I'll try 2) Yep should have done the "kill -3" before :)
Here is the interresting part of the dump:
Found one Java-level deadlock:
=============================
"ThreadPool-Worker-101":
waiting to lock monitor 0x00000001009ad858 (object 0x0000000106d40c38, a fanx.util.ThreadPool),
which is held by "Thread-0"
"Thread-0":
waiting to lock monitor 0x000000010093f0b0 (object 0x0000000106377c38, a fanx.util.ThreadPool$Worker),
which is held by "ThreadPool-Worker-60"
"ThreadPool-Worker-60":
waiting to lock monitor 0x00000001009ad858 (object 0x0000000106d40c38, a fanx.util.ThreadPool),
which is held by "Thread-0"
Java stack information for the threads listed above:
===================================================
"ThreadPool-Worker-101":
at fanx.util.ThreadPool.ready(ThreadPool.java:167)
waiting to lock <0x0000000106d40c38> (a fanx.util.ThreadPool)
at fanx.util.ThreadPool$Worker.run(ThreadPool.java:273)
locked <0x000000010651e998> (a fanx.util.ThreadPool$Worker)
"Thread-0":
at fanx.util.ThreadPool$Worker.run(ThreadPool.java:292)
waiting to lock <0x0000000106377c38> (a fanx.util.ThreadPool$Worker)
at fanx.util.ThreadPool.submit(ThreadPool.java:141)
locked <0x0000000106d40c38> (a fanx.util.ThreadPool)
at fan.sys.ActorPool.submit(ActorPool.java:100)
at fan.sys.Actor._enqueue(Actor.java:169)
locked <0x0000000106db2d58> (a java.lang.Object)
at fan.sys.Actor._send(Actor.java:146)
at fan.sys.Actor.send(Actor.java:86)
at fan.testSys.ActorTest$testOrdering$6.doCall(ActorTest.fan:82)
at fan.testSys.ActorTest$testOrdering$6.call(ActorTest.fan:82)
at fan.sys.FanInt.times(FanInt.java:485)
at fan.testSys.ActorTest.testOrdering(ActorTest.fan:82)
at fan.RunTestOrdering_0.RunTestOrdering.main(/work/fan/fan/src/testSys/RunTestOrdering.fan:10)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at fan.sys.Method.invoke(Method.java:536)
at fan.sys.Method$MethodFunc.callList(Method.java:182)
at fan.sys.Method.callList(Method.java:147)
at fanx.tools.Fan.callMain(Fan.java:136)
at fanx.tools.Fan.executeFile(Fan.java:89)
at fanx.tools.Fan.execute(Fan.java:35)
at fanx.tools.Fan.run(Fan.java:241)
at fanx.tools.Fan.main(Fan.java:279)
"ThreadPool-Worker-60":
at fanx.util.ThreadPool.ready(ThreadPool.java:167)
waiting to lock <0x0000000106d40c38> (a fanx.util.ThreadPool)
at fanx.util.ThreadPool$Worker.run(ThreadPool.java:273)
locked <0x0000000106377c38> (a fanx.util.ThreadPool$Worker)
Found 1 deadlock.
Heap par new generation total 18624K, used 15011K [0x0000000105b10000, 0x0000000106fd0000, 0x0000000106fd0000)
eden space 16000K, 78% used [0x0000000105b10000, 0x000000010674c8a0, 0x0000000106ab0000)
from space 2624K, 94% used [0x0000000106d40000, 0x0000000106fac440, 0x0000000106fd0000)
to space 2624K, 0% used [0x0000000106ab0000, 0x0000000106ab0000, 0x0000000106d40000)
concurrent mark-sweep generation total 62656K, used 12843K [0x0000000106fd0000, 0x000000010ad00000, 0x000000010af10000)
concurrent-mark-sweep perm gen total 21248K, used 11169K [0x000000010af10000, 0x000000010c3d0000, 0x0000000110310000)
brianMon 14 Sep 2009
Promoted to ticket #743 and assigned to brian
The deadlock is definitely a bug, looks like there must be some case where I am grabbing locks in different orders.
freddy33Tue 15 Sep 2009
I think I found the issue. I hope I can show you a patch in a couple of hours. Did you already fix it?
freddy - thanks a lot for looking into this issue!
I was trying to stay away from java.util.concurrent and sticking to primitives to make Fan easy to port to embedded devices. But I'll take a look at your code (probably tomorrow morning).
freddy33Wed 16 Sep 2009
OK, that explains :)
So, I was think about this and:
Embedded environment will required a fix number of threads that should not die. In this case if the ActorPool is fixed, using java.util.concurrent can be easily avoided.
In server environment, we just migrated all our code base to pure java.util.concurrent (no more synchronized), and it really saved the day (performance and stability).
Do you think that having a symbol defining which Actor implementation the user which to use is a good idea? I will have a shot at it (symbol + util.concurrent implemntation), and measure the impact (code and performance). You'll see if it makes sense.
brianWed 16 Sep 2009
I doubt there is a huge difference b/w the two (I know the Atomic stuff does use a special instruction on x86 processors though). Actually using concurrent is probably ok, I've seen it ported to older VMs.
However, for future work I'd like to keep generics out of the Java code base (since that makes it impossible to port Fan to J2ME).
I'm going to take a look at the patches this morning.
BTW: looking at your patches - were you getting a failure in DateTimeTest?
brianWed 16 Sep 2009
Freddy
Here is my fix to RepoTest which basically just keeps rewriting the file until we see a modified timestamp change changeset.
Can you verify that fix?
freddy33Wed 16 Sep 2009
I have the same doubt, that's why I want to truly check the impact (depending on the number of core -2, 4, 8-, and CPU cycles). For DateTimeTest in the line I changed, the dt2 was equal to dt1 (250ms tolerance Duration) and so DateTime.nowTicks was greater than dt2.ticks.
freddy33Wed 16 Sep 2009
RepoTest is OK for me now, thanks.
brianWed 16 Sep 2009
RepoTest is OK for me now, thanks.
Great - thanks for helping out!
For DateTimeTest in the line I changed, the dt2 was equal to dt1 (250ms tolerance Duration) and so DateTime.nowTicks was greater than dt2.ticks.
OK, that makes perfect sense. I am surprised we've never seen that before. I think the fix is slightly more complicated, b/c Time.now always uses the cached version - so I changed the code around a bit:
Let me know if that doesn't fix your test failures.
I'm still digging into the deadlock, trying to re-immerse myself in that code.
BTW: how many cores is your machine?
freddy33Wed 16 Sep 2009
Great, checking it now. I have 2 on the Mac with Snow Leopard, 4 on the desktop with Ubuntu, and 8 with CentOS on the test machine. All 64bits. So, I'll setup automated tests of the hg on each one, and measure the perf of the Actor implementation(s).
brianWed 16 Sep 2009
Using the AtomicReference code I see a significant slow down in some of the tests - so I think something might be off there. For example testOrdering went from 1.7sec to 7sec.
I came up with another fix although it had a negative impact on performance too (not too bad though). Although I guess for now we should make sure we have correct code, then we can optimize later.
freddy33Sat 19 Sep 2009
I just pushed a version a little cleaner (no generics and atomic) that does not thread lock, but still takes 7secs :(
freddy33Sat 19 Sep 2009
So, I managed to remove the need to sync around the Worker.work modifications, and I think I found the root of the issue:
An already declared idle worker, could enter into idle again.
So, I enforce a remove from the idle list before checking for pending tasks.
I'm checking this final version on 8 cores now.
brianSun 20 Sep 2009
I tried your latest patches - I am still seeing prolonged timing problems (6sec for testOrdering) and long delays in testLocals.
The fix that seems to work the best is to always have the Worker pull from the ThreadPool's work queue so that only one lock is needed (but it is a bit slower then current design, and still has some kinks in shutdown).
brianMon 21 Sep 2009
Freddy,
OK - I think I have a patch for the ThreadPool deadlock which passes the test suite and doesn't introduce any significant performance problems.
freddy33 Mon 14 Sep 2009
I don't know if it was reported, but anyway here are my latest issues.
I'm using now Snow Leopard with Java 6 64bits, and decided to build from hg (rev 1952). I stopped doing hg build for some time, so I don't think it's Snow Leopard related.
1) Strange test assertion for me in testSys::RepoTest ?
The line of code is:
verifyNotSame(Repo.readSymbols(uri), Repo.readSymbols(uri))
which looks equal to me :)2) When reaching ActorTest it always dead lock ?
From there, no CPU just stuck.
I'm creating the IntelliJ project that represent the Fan source code, and I hope to have a shot at debugging this soon if you don't know where it comes from.
HTH, Fred.
brian Mon 14 Sep 2009
1) I have seen that on OS X before too - I believe it is a file system timestamp thing, where the Windows is a little quicker to notice the a modified file and re-load. Maybe a longer pause might fix it. Does anybody know if OS X support sub-second resolution on file timestamps?
2) Is it possible to get a stack dump from this when it happens? I know on Windows Ctrl+Break will dump the current threads to stdout.
freddy33 Mon 14 Sep 2009
1) Good point I'll try 2) Yep should have done the "kill -3" before :)
Here is the interresting part of the dump:
brian Mon 14 Sep 2009
Promoted to ticket #743 and assigned to brian
The deadlock is definitely a bug, looks like there must be some case where I am grabbing locks in different orders.
freddy33 Tue 15 Sep 2009
I think I found the issue. I hope I can show you a patch in a couple of hours. Did you already fix it?
freddy33 Tue 15 Sep 2009
OK, I have http://hg.jfrog.org/hg/fan-1.0/rev/5c9e97b58d0d that solves the dead lock, checking about the file timestamp issue.
freddy33 Tue 15 Sep 2009
Some more issue with dead lock (stop was stalling), and minor timing issues in test. Here: http://hg.jfrog.org/hg/fan-1.0/rev/e0171e1d34d2 and http://hg.jfrog.org/hg/fan-1.0/rev/41bf5cda9457
Hope it helps.
brian Tue 15 Sep 2009
freddy - thanks a lot for looking into this issue!
I was trying to stay away from java.util.concurrent and sticking to primitives to make Fan easy to port to embedded devices. But I'll take a look at your code (probably tomorrow morning).
freddy33 Wed 16 Sep 2009
OK, that explains :)
So, I was think about this and:
Do you think that having a symbol defining which Actor implementation the user which to use is a good idea? I will have a shot at it (symbol + util.concurrent implemntation), and measure the impact (code and performance). You'll see if it makes sense.
brian Wed 16 Sep 2009
I doubt there is a huge difference b/w the two (I know the Atomic stuff does use a special instruction on x86 processors though). Actually using concurrent is probably ok, I've seen it ported to older VMs.
However, for future work I'd like to keep generics out of the Java code base (since that makes it impossible to port Fan to J2ME).
I'm going to take a look at the patches this morning.
BTW: looking at your patches - were you getting a failure in DateTimeTest?
brian Wed 16 Sep 2009
Freddy
Here is my fix to RepoTest which basically just keeps rewriting the file until we see a modified timestamp change changeset.
Can you verify that fix?
freddy33 Wed 16 Sep 2009
I have the same doubt, that's why I want to truly check the impact (depending on the number of core -2, 4, 8-, and CPU cycles). For DateTimeTest in the line I changed, the dt2 was equal to dt1 (250ms tolerance Duration) and so
DateTime.nowTicks
was greater thandt2.ticks
.freddy33 Wed 16 Sep 2009
RepoTest is OK for me now, thanks.
brian Wed 16 Sep 2009
Great - thanks for helping out!
OK, that makes perfect sense. I am surprised we've never seen that before. I think the fix is slightly more complicated, b/c
Time.now
always uses the cached version - so I changed the code around a bit:changeset
Let me know if that doesn't fix your test failures.
I'm still digging into the deadlock, trying to re-immerse myself in that code.
BTW: how many cores is your machine?
freddy33 Wed 16 Sep 2009
Great, checking it now. I have 2 on the Mac with Snow Leopard, 4 on the desktop with Ubuntu, and 8 with CentOS on the test machine. All 64bits. So, I'll setup automated tests of the hg on each one, and measure the perf of the Actor implementation(s).
brian Wed 16 Sep 2009
Using the AtomicReference code I see a significant slow down in some of the tests - so I think something might be off there. For example testOrdering went from 1.7sec to 7sec.
I came up with another fix although it had a negative impact on performance too (not too bad though). Although I guess for now we should make sure we have correct code, then we can optimize later.
freddy33 Sat 19 Sep 2009
I just pushed a version a little cleaner (no generics and atomic) that does not thread lock, but still takes 7secs :(
freddy33 Sat 19 Sep 2009
So, I managed to remove the need to sync around the Worker.work modifications, and I think I found the root of the issue:
I'm checking this final version on 8 cores now.
brian Sun 20 Sep 2009
I tried your latest patches - I am still seeing prolonged timing problems (6sec for testOrdering) and long delays in testLocals.
The fix that seems to work the best is to always have the Worker pull from the ThreadPool's work queue so that only one lock is needed (but it is a bit slower then current design, and still has some kinks in shutdown).
brian Mon 21 Sep 2009
Freddy,
OK - I think I have a patch for the ThreadPool deadlock which passes the test suite and doesn't introduce any significant performance problems.
changeset
Can you test that out on your various platforms and souped up 8-core monster?
Thanks!
freddy33 Mon 21 Sep 2009
BRAVO!
Success on all platform: Mac OS X 10.6 2 cores with java version "1.6.0_15"
CentOS 8 cores with java version "1.6.0_07"
Debian 8 cores with java version "1.6.0_12"
I'll keep checking and may be setup a hudson for it :)
brian Mon 21 Sep 2009
Ticket resolved in 1.0.47
Great - thanks a lot for helping out so much!