java7中的ThreadLocalRandom
2015-07-03 16:36
579 查看
本文转载自:http://mabusyao.iteye.com/blog/1362826
今天早上看到一个关于java7中的ThreadLocalRandom的用法的帖子,说是比Math.Random()速度要快一倍,转过来学习一下 :When I first wrote this blog my intention was to introduce you to a class ThreadLocalRandomwhich is new in Java 7 to generate random numbers. I have analyzed the performance of
ThreadLocalRandomin a series of micro-benchmarks to find out how it performs in a single threaded environment. The results were relatively surprising: although the code is very similar, ThreadLocalRandomis twice as fast as Math.random()! The results drew
my interest and I decided to investigate this a little further. I have documented my anlysis process. It is an examplary introduction into analysis steps, technologies and some of the JVM diagnostic tools required to understand differences in the performance
of small code segments. Some experience with the described toolset and technologies will enable you to write faster Java code for your specific Hotspot target environment.
OK, that's enough talk, let's get started!
Math.random() works on a static singleton instance of Randomwhilst ThreadLocalRandom-> current() -> nextDouble() works on a thread local instance of ThreadLocalRandomwhich is
a subclass of Random. ThreadLocal introduces the overhead of variable look up on each call to the current()-method. Considering what I've just said, then it's really a little surprising that it's twice as fast as Math.random() in a single thread, isn't it?
I didn't expect such a significant difference.
Again, I am using a tiny micro-benchmarking framework presented in
one of Heinz blogs. The framework that Heinz developed takes care of several challenges in benchmarking Java programs on modern JVMs. These challenges include: warm-up, garbage
collection, accuracy of Javas time API, verification of test accuracy and so forth.
Here are my runnable benchmark classes:
view
source
print?
01.
public
class
ThreadLocalRandomGenerator
implements
BenchmarkRunnable {
02.
03.
private
double
r;
04.
05.
@Override
06.
public
void
run() {
07.
r = r + ThreadLocalRandom.current().nextDouble();
08.
}
09.
10.
public
double
getR() {
11.
return
r;
12.
}
13.
14.
@Override
15.
public
Object getResult() {
16.
return
r;
17.
}
18.
19.
}
20.
21.
public
class
MathRandomGenerator
implements
BenchmarkRunnable {
22.
23.
private
double
r;
24.
25.
@Override
26.
public
void
run() {
27.
r = r + Math.random();
28.
}
29.
30.
public
double
getR() {
31.
return
r;
32.
}
33.
34.
@Override
35.
public
Object getResult() {
36.
return
r;
37.
}
38.
}
Let's run the benchmark using Heinz' framework:
view
source
print?
01.
public
class
FirstBenchmark {
02.
03.
private
static
List<BenchmarkRunnable> benchmarkTargets = Arrays.asList(
new
MathRandomGenerator(),
04.
new
ThreadLocalRandomGenerator());
05.
06.
public
static
void
main(String[] args) {
07.
DecimalFormat df =
new
DecimalFormat(
"#.##"
);
08.
for
(BenchmarkRunnable runnable : benchmarkTargets) {
09.
Average average =
new
PerformanceHarness().calculatePerf(
new
PerformanceChecker(
1000
, runnable),
5
);
10.
System.out.println(
"Benchmark target: "
+ runnable.getClass().getSimpleName());
11.
System.out.println(
"Mean execution count: "
+ df.format(average.mean()));
12.
System.out.println(
"Standard deviation: "
+ df.format(average.stddev()));
13.
System.out.println(
"To avoid dead code coptimization: "
+ runnable.getResult());
14.
}
15.
}
16.
}
Notice: To make sure the JVM does not identify the code as "dead code" I return a field variable and print out the result of my benchmarking immediately. That's why my runnable classes implement an interface called RunnableBenchmark.
I am running this benchmark three times. The first run is in default mode, with inlining and JIT optimization enabled:
view
source
print?
1.
Benchmark target: MathRandomGenerator
2.
Mean execution count: 14773594,4
3.
Standard deviation: 180484,9
4.
To avoid dead code coptimization: 6.4005410634212025E7
5.
Benchmark target: ThreadLocalRandomGenerator
6.
Mean execution count: 29861911,6
7.
Standard deviation: 723934,46
8.
To avoid dead code coptimization: 1.0155096190946539E8
Then again without JIT optimization (VM option -Xint):
view
source
print?
1.
Benchmark target: MathRandomGenerator
2.
Mean execution count: 963226,2
3.
Standard deviation: 5009,28
4.
To avoid dead code coptimization: 3296912.509302683
5.
Benchmark target: ThreadLocalRandomGenerator
6.
Mean execution count: 1093147,4
7.
Standard deviation: 491,15
8.
To avoid dead code coptimization: 3811259.7334526842
The last test is with JIT optimization, but with -XX:MaxInlineSize=0 which (almost) disables inlining:
view
source
print?
1.
Benchmark target: MathRandomGenerator
2.
Mean execution count: 13789245
3.
Standard deviation: 200390,59
4.
To avoid dead code coptimization: 4.802723374491231E7
5.
Benchmark target: ThreadLocalRandomGenerator
6.
Mean execution count: 24009159,8
7.
Standard deviation: 149222,7
8.
To avoid dead code coptimization: 8.378231170741305E7
Let's interpret the results carefully: With full JVM JIT optimization the ThreadLocalRanom is twice as fast as Math.random(). Turning JIT optimization off shows that the two perform equally good (bad) then. Method inlining seems to make 30% of the performance
difference. The other differences may be due to other
otimization techniques.
One reason why the JIT compiler can tune ThreadLocalRandommore effectively is the improved implementation of ThreadLocalRandom.next().
view
source
print?
01.
public
class
Random
implements
java.io.Serializable {
02.
...
03.
protected
int
next(
int
bits) {
04.
long
oldseed, nextseed;
05.
AtomicLong seed =
this
.seed;
06.
do
{
07.
oldseed = seed.get();
08.
nextseed = (oldseed * multiplier + addend) & mask;
09.
}
while
(!seed.compareAndSet(oldseed, nextseed));
10.
return
(
int
)(nextseed >>> (
48
- bits));
11.
}
12.
...
13.
}
14.
15.
public
class
ThreadLocalRandom
extends
Random {
16.
...
17.
protected
int
next(
int
bits) {
18.
rnd = (rnd * multiplier + addend) & mask;
19.
return
(
int
) (rnd >>> (
48
-bits));
20.
}
21.
...
22.
}
The first snippet shows Random.next() which is used intensively in the benchmark of Math.random(). Compared to ThreadLocalRandom.next() the method requires significantly more instructions, although both methods do the same thing. In the Randomclass the seed
variable stores a global shared state to all threads, it changes with every call to the next()-method. Therefore AtomicLong is required to safely access and change the seed value in calls to nextDouble(). ThreadLocalRandomon the other hand is - well - thread
local :-) The next()-method does not have to be thread safe and can use an ordinary long variable as seed value.
About method inlining and ThreadLocalRandom
One very effective JIT optimization is method inlining. In hot paths executed frequently the hotspot compiler decides to inline the code of called methods (child method) into the callers method (parent method). "Inlining has important benefits. It dramatically
reduces the dynamic frequency of method invocations, which saves the time needed to perform those method invocations. But even more importantly, inlining produces much larger blocks of code for the optimizer to work on. This creates a situation that significantly
increases the effectiveness of traditional compiler optimizations, overcoming a major obstacle to increased Java programming language performance."
Since Java 7 you can monitor method inlining by using diagnostic JVM options. Running the code with '-XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining' will show the inlining efforts of the JIT compiler. Here are the relevant sections of the output for Math.random()
benchmark:
view
source
print?
1.
@
13
java.util.Random::nextDouble (
24
bytes)
2.
@
3
java.util.Random::next (
47
bytes) callee is too large
3.
@
13
java.util.Random::next (
47
bytes) callee is too large
The JIT compiler cannot inline the Random.next() method that is called in Random.nextDouble(). This is the inlining output of ThreaLocalRandom.next():
view
source
print?
1.
@
8
java.util.Random::nextDouble (
24
bytes)
2.
@
3
java.util.concurrent.ThreadLocalRandom::next (
31
bytes)
3.
@
13
java.util.concurrent.ThreadLocalRandom::next (
31
bytes)
Due to the fact that the next()-method is shorter (31 bytes) it can be inlined. Because the next()-method is called intensively in both benchmarks this log suggests that method inlining may be one reason why ThreadLocalRandomperforms significantly faster.
To verify that and to find out more it is required to deep dive into assembly code. With Java 7 JDKs it is possible to print out assembly code into the console. See here on
how to enable -XX:+PrintAssembly VM Option. The option will print out the JIT optimized code, that means you can see the code the JVM actually executes. I have copied the relevant assembly code into the links below.
Assembly code of ThreadLocalRandomGenerator.run() here.
Assembly code of MathRandomGenerator.run() here.
Assembly code of Random.next() called by Math.random() here.
Assembly code is machine-specific and
low level code, it's more complicated to read then bytecode.
Let's try to verify that method inlining has a relevant effect on performance in my benchmarks and: are there other obvious differences how the JIT compiler treats ThreadLocalRandomand Math.random()? In ThreadLocalRandomGenerator.run() there is no procedure
call to any of the subroutines like Random.nextDouble() or ThreatLocalRandom.next(). There is only one virtual (hence expensive) method call to ThreadLocal.get() visible (see line 35 in ThreadLocalRandomGenerator.run() assembly). All the other code is inlined
into ThreadLocalRandomGenerator.run(). In the case of MathRandomGenerator.run() there are two virtual method calls to Random.next() (see block B4 line 204 ff. in the
assembly code of MathRandomGenerator.run()). This fact confirms our suspicion that method inlining is one important root cause for the performance difference. Further more, due to synchronization hassle, there are considerably more (and some expensive!) assembly
instructions required in Random.next() which is also counterproductive in terms of execution speed.
Understanding the overhead of the invokevirtual instruction
So why is (virtual) method invocation expensive and method inlining so effective? The pointer of invokevirtual instructions is not an offset of a concrete method in a class instance. The compiler does not know the internal layout of a class instance. Instead,
it generates symbolic references to the methods of an instance, which are stored in the runtime constant pool. Those runtime constant pool items are resolved at run time to
determine the actual method location. This dynamic (run-time) binding requires verification, preparation and resolution which can considerably effect performance. (see Invoking
Methods and Linking in
the JVM Spec for details)
That's all for now. The disclaimer: Of course, the list of topics you need to understand to solve performance riddles is endless. There is a lot more to understand then micro-benchmarking, JIT optimization, method inlining, java byte code, assemby language
and so forth. Also, there are lot more root causes for performance differences then just virtual method calls or expensive thread synchronization instructions. However, I think the topics I have introduced are a good start into such deep diving stuff. Looking
forward to critical and enjoyable comments!
相关文章推荐
- java错误总结
- java常见的输入和输出流案例研究(一个)
- java基础知识随笔--类
- Java - What is final in Java?
- java学习08--程序控制流程--判断结构if练习
- Java数组与泛型
- eclipse快捷键大全
- IntelliJ IDEA 14.x 创建工作空间与多个Java Web项目
- Missing artifact javax.jms:jms:jar: Missing artifact com.sun.jdmk Missing artifact com.sun.jmx:jmxri
- 黑马程序员学习日记 Eclipse常用设置
- 谈话Java在ThreadLocal理解类
- web.xml中通过contextConfigLocation的读取spring的配置文件
- Java常用正则表达式
- eclipse下的jsp:The user operation is waiting for "Building workspace" to complete
- Eclipse修改默认Author
- 使用 Spring Boot 快速构建 Spring 框架应用
- ios下使用RSA算法加密与java后台解密配合demo
- CXF+Spring 搭建的WebService
- 使用IntelliJ IDEA,gradle开发Java web应用步骤
- Lombok 安装、入门 - 消除冗长的 java 代码