All Topics

#378 Equals/Compare keyword proposal

jodastephen Fri 24 Oct 2008

Re-reading the debate made me realise that we haven't defined the underlying semantic of equals - is the purpose of equals() to allow the original object to be substituted for another? Or something weaker?

This is critical, as there are some cases where the results differ, like Java's BigDecimal. There, 2.00 is not equals() to 2.0, yet they are compareTo() equal. ie. equal() is saying "you cannot replace this instance by the other, as there is a difference in state", whereas compareTo() is defining a separate, different, comparison state-space.

If we go with the hard line "must be substitutable" definition, then obviously the types must match. In fact, so must all non-derived state.

All this then gets messy, as the == operator is used by both equals() and compare(). Thus, there is great opportunity for whatever we have to be abused. And that worries me. Hence, part of me wants to remove compare() entirely from objects, forcing people to use dedicated Comparator instances. But thats not practical.

So, what is it that makes compare() safe to implement on the object itself? Well, surely it is safe if the same state-space is used for compare() as for equals(). Only then do the two align perfectly. So how can we make use of this observation?

For me, this leads to field annotations:

class Date {
  state Int year
  state Month month
  state Int dayOfMonth
  DayOfWeek dayOfWeek
}

This class has an auto-generated equals() method which compares year, month and dayOfMonth but not dayOfWeek.

But what about the type? Is that part of the state? Or do we use common base class?

I still prefer common base class. And this can be emphasied by allowing just a single class in a hierarchy to define state. This could be made more clear by documenting the class that defines state with a keyword before class:

state class Date {
  state Int year
  state Month month
  state Int day
  DayOfWeek dayOfWeek
}
class NamedDate : Date {  // cannot be defined as 'state' - compiler error if you try to
  Str name  // cannot be defined as 'state' - compiler error if you try to
}

And compare? Well that would be auto-generated too - just add a mixin:

state class Date : Comparable {
  state Int year
  state Month month
  state Int day
  DayOfWeek dayOfWeek
}

The same state keyword is used for both - ensuring the same state-space is used for equals() and compare(). The comparison is done in the order that the fields are declared - year, month then dayOfMonth.

More complicated checks can utilize abstract, once and calculated fields. These should allow some programmer interaction with the process.

The nice part about this approach is its simplicity. You cannot go wrong.

It does place restrictions on the equals/compare contracts above and beyond Java, but that I would argue is a good thing. For example, the Java BigDecimal class would not be possible in Fan - one of the two definitions of comparison would have to be chosen.

helium Fri 24 Oct 2008

There are things with a natural order like real numbers and there are things with no natural order like complex numbers. Sometiems you want to order things differently depending on the context. You can order persons by there last name, by age by monthly income whatever.

=> I don't like compare on every object, either.

There are two kinds of objects, those defined by their values and those defined by their identity. E.g. if you have to points with the same coordinates they are the same. The same is true for dates. Objects that are defined by their values should use structural comparison to determine equality. But there are objects with an identity. You can have two cars of the same brand in the same color built in the same year ... every thing you store about them is the same, yet they are two distinct cars.

The proposed state keyword works perfectly for value objects but does not for identity objects.

And state might be a bit misleading as structural comparison works perfectly for imutable objects.

I don't see how the Comparable mixin might work. How does it know how to order dates? You have to compare the fields in the right order. If the years are different you have the result, if they are the smae compare the month, ...

edit: typo

jodastephen Fri 24 Oct 2008

The proposed state keyword wourks perfectly for value objects but does not for identity objects.

The state keyword is optional. If you don't declare it, the default equals() method is inherited from Obj, which compares identity.

I don't see how the Comparable mixin might work. How does it know how to order dates?

As the text says - it matches in the order that the fields are declared. Simple really.

helium Fri 24 Oct 2008

Damn, I should learn to read.

Yet Comparble seems a bit too magical to me.

brian Fri 24 Oct 2008

Stephen,

To be clear, my understanding of your proposal is:

mark fields with some annotation (I'd prefer a facet vs keyword)
auto-generate equals if any fields are marked as such
auto-generate compare if Comparable mixin is declare
disallow developers to write their own equals or compare methods

Is that correct?

Do you guys have some large codebases you could grep to see if that would really work? For instance would that model cover every case you have in Joda Time Stephan?

The case I can think of that I know I've done in the past is to have a value type which compares by field values, then in a subclass switch it to a identity type which just compares equality via ===.

Also I think I'm leaning much more towards the idea that types have to match explicitly or else two objects are unequal.

JohnDG Fri 24 Oct 2008

I don't like this idea, because it requires data be stored in some normal form. For example, a file object might hold a string field set to "file://foo.txt", or it might hold a string field set to "foo.txt". Conceptually, these represent the same resource.

Similarly, a quadrilateral is defined by 4 points, pt1, pt2, pt3, pt4, but two quadrilaterals are equal only if they have the same points in the same order -- but not necessarily with the same labels (e.g. this.pt1 might equal that.pt2, etc.).

A rational class might store two fields, p and q, both of which are integers, but two rationals are equal only if p1/q1 == p2/q2.

Two maps might store the exact same key/value pairs, but the layout in memory might be different -- yet the maps should still be considered equal.

Requiring that users store data in normal form is unreasonable, in my opinion, and onerous enough that it simply won't be done as often as it should be.

Also I think I'm leaning much more towards the idea that types have to match explicitly or else two objects are unequal.

This is not a useful definition of equality in about 5-10% of cases. Some frameworks, moreover, require a different definition of equality. For example, take a decorator class, a proxy/stub, or other class that makes use of delegation in some way: different types should be considered equal, and if you forbid it, then people will end up using helper functions to determine equality, which will be a mess.

The proposal I suggested quite nicely handles all these cases and other "open type" cases, while still guaranteeing the equality contract.

is the purpose of equals() to allow the original object to be substituted for another?

That's too strong in general. equals() can mean whatever the user wants it to mean, strong or weak, as long as it obeys a contract that enables low-level structures like lists and maps to work correctly.

katox Fri 24 Oct 2008

it matches in the order that the fields are declared.

Adding ordering to declarative field definitions is not something I would like to see in Fan (or elsewhere ;).

jodastephen Fri 24 Oct 2008

I'm approaching this from the perspective of keeping it simple for most cases, and following Fan's declarative style.

To be clear, my understanding of your proposal is: 1. mark fields with some annotation (I'd prefer a facet vs keyword) 2. auto-generate equals if any fields are marked as such 3. auto-generate compare if Comparable mixin is declare 4. disallow developers to write their own equals or compare methods

Yes. Although I could compromise on point 4, as JohnDG has some good points about normalized data and framework definitions. However I definitely want most users to define equals declaratively - its safer for a start.

I know that there is controversy about using facets vs keywords, but I remain of the opinion that all core language concepts should use keywords.

Finally, a hashcode method will also need generating.

The case I can think of that I know I've done in the past is to have a value type which compares by field values, then in a subclass switch it to a identity type which just compares equality via ===.

That sounds weird.

Do you guys have some large codebases you could grep to see if that would really work? For instance would that model cover every case you have in Joda Time Stephan?

Hmmm. I know that it would work for the entire Java Bean style model of our application, as the equals/hashcode method is currently auto-code-generated.

I'm also pretty sure that it would work for the Joda-Time/JSR-310 style value objects. (ZonedDateTime is composed of OffsetDateTime and TimeZone, OffsetDateTime is composed of LocalDateTime and ZoneOffset, LocalDateTime is composed of LocalDate abd LocalTime, and so on - declarative equals() works very well here.)

...then people will end up using helper functions to determine equality, which will be a mess.

If that happens, we've failed. (Although I should note that there will always be helpers - Str.equals vs Str.equalsIgnoreCase)

msl Tue 19 Jan 2010

I'm not entirely sure why this needs to go so far down into the depths of the language as to need a keyword or a facet which is then acted upon by the compiler when the same functionality can be provided via library extensions such as mixins.

I love scala's const class - but that's mostly because of the automatic generation of toString, hash and equals methods rather than any of the fancy-shmancy pattern matching stuff :)

For what it's worth, I've written a pretty quick and nasty set of mixins which provide consistent hash(), equals(Obj) and toStr() based on a class's properties. It only meets a sub-set of all possible use cases but it's met all of my (toy project) needs so far.

The source for it is on github, based on code posted by brian.

I haven't done a compare() mixin - but it should be a trivial task to do so based on the existing equals mixin. I also haven't thought much about the sort of normalization and comparison that JohnDG is discussing.

jodastephen Wed 20 Jan 2010

The mixin approach is tempting, but its going to be slow, as it effectively uses reflection. As Scala shows, some kind of language feature is preferable.

BTW, I have no problem with something that looks like a mixin but actually generates real code in the subclass. This could be achieved as a compiler plugin (and would be sweet in general) something like this.

mixin State : CodeGeneratingMixin {
  Void generateCode(AstNode) { ... }
}

The state keyword that the thread starts with is a simpler approach for users however, as it allows selective choice over which properties to include in the state.

Finally, I would note that this is a feature that always comes up in cross language comparisons, and one Fantom is weak at right now.

brian Thu 21 Jan 2010

I'd be interested if someone wanted to run some tests to get performance difference b/w hardcoded versus mixin-reflection hash/equals performance.

msl Fri 22 Jan 2010

I'd be interested if someone wanted to run some tests to get performance difference b/w hardcoded versus mixin-reflection hash/equals performance.

I'd already done something along these lines for my testing my hash() generator (just committed to github).

You can read the source to pick all of the holes in my simplistic approach, but as an indicative summary:

Run 1:
Processed 25000 small normal records in 3548009ns
Processed 25000 small mixin records in 33489977ns
Processed 25000 medium normal records in 9266394ns
Processed 25000 medium mixin records in 81102928ns
Processed 25000 large normal records in 34175405ns
Processed 25000 large mixin records in 327532434ns

Run 2:
Processed 25000 small normal records in 3493471ns
Processed 25000 small mixin records in 34698311ns
Processed 25000 medium normal records in 9138741ns
Processed 25000 medium mixin records in 81708284ns
Processed 25000 large normal records in 33288827ns
Processed 25000 large mixin records in 324770242ns

Run 3:
Processed 25000 small normal records in 3610851ns
Processed 25000 small mixin records in 35824456ns
Processed 25000 medium normal records in 11521686ns
Processed 25000 medium mixin records in 98526086ns
Processed 25000 large normal records in 33528849ns
Processed 25000 large mixin records in 328527916ns

Or to summarize (excuse the dodgy ASCII art)

+--------------+----------------------+--------+
| Object Size  |  Average Time (ns)   | Factor |
| (properties) | (normal) |  (mixin)  |        |
+--------------+----------+-----------+--------+
| Small (3)    |  3550777 |  34670914 |   9.76 |
| Medium (15)  |  9975607 |  87112432 |   8.73 |
| Large (60)   | 33664360 | 326943530 |   9.71 |
+--------------+----------+-----------+--------+

So, a couple of things... There seems to be a consistent overhead of around 10x for using reflection over direct access.

Number of properties being reflected against doesn't seem to vary this overhead (small to large doesn't have a discernible impact).

Note: There are probably a bunch of holes in my code that make all of these numbers irrelevant and meaningless. You have been warned.

brian Fri 22 Jan 2010

Interesting your test gets you those numbers.

Below is the code for a test I just did. And the numbers shows that the hard-coded solution and the reflection solution have just about the same performance to populating a hash map:

Hard: 1292ms
Soft: 1466ms

Code:

class Main
{
  static Void main() {
    // warm up
    hm := Hard:Hard[:]
    sm := Soft:Soft[:]
    10000.times { h := Hard(); hm[h] = h; s := Soft(); sm[s] = s }

    // test both
    hm.clear; sm.clear; num := 100_000
    t1 := Duration.now
    num.times { h := Hard(); hm[h] = h }
    t2 := Duration.now
    num.times { s := Soft(); sm[s] = s }
    t3 := Duration.now

    echo("Hard: " + (t2-t1).toLocale)
    echo("Soft: " + (t3-t2).toLocale)
  }
}

const class Hard {
  const Int x := Int.random
  const Str y := Buf.random(20).toHex
  const Float z := Float.random
  override Int hash() { x.hash.xor(y.hash).xor(z.hash) }
  override Bool equals(Obj? o) {
    if (o isnot Hard) return false
    h := (Hard)o
    return x == h.x && y == h.y && z == h.z
  }
}

const class Soft {
  const Int x := Int.random
  const Str y := Buf.random(20).toHex
  const Float z := Float.random
  override Int hash() {
    r := 33
    typeof.fields.each |f| { r = r.xor(f.get(this).hash) }
    return r
  }
  override Bool equals(Obj? o) {
    t := o.typeof
    if (o !== t) return false
    return t.fields.all |f| { f.get(this) == f.get(o) }
  }
}

helium Fri 22 Jan 2010

What does

t := o.typeof
if (o !== t) return false

mean? o must not be the same object as its runtime type representation? That would never be true. Or am I screwing something up in my head?

brian Fri 22 Jan 2010

Good catch, it should be

t := o.typeof
if (t !== typeof) return false

Although interestingly enough in that test it doesn't really matter b/c we hardly ever got a hash collision. If I change the fields to have more collisions we got a more even distribution of both hash and equals calls:

const Int x := (0..10).random
const Str y := Buf.random(2).toHex
const Float z := 83f

Then I get the following numbers:

Hard: 1373ms
Soft: 1894ms

Still the reflection performance is pretty good compared to hard-coding the hash/equals methods.

helium Fri 22 Jan 2010

Fantoms reflection is really fast. I quickly rewrote the test to C# and got these results in three runs:

Hard: 215ms
Soft: 1141ms
Hard: 253ms
Soft: 1149ms
Hard: 217ms
Soft: 1149ms

Fantom:

Hard: 580ms
Soft: 620ms
Hard: 589ms
Soft: 627ms
Hard: 569ms
Soft: 612ms

jodastephen Fri 22 Jan 2010

But why is Fantom's reflection faster than Java/C#'s refelection? What are the bytecodes that make it faster?

I still favour a codegen approach, for two reasons:

these are core methods in performance critical sections of code, in loops and collections
having a mixin/facet able to generate code in the implementing class is useful in many, many other ways (eg. @Log might wrap logging statements around each method for entry/exit tracing, etc.)

Nevertheless, if Fantom has reflection based Equals/Hashcode and Comparable in the core distribution, it will be a major step forward.

msl Fri 22 Jan 2010

the numbers shows that the hard-coded solution and the reflection solution have just about the same performance

The only reasons I can think of for the differences in performance:

Extra null checks befor calling property.hash
Iterating an instances properties on each call (would be good if a mixin could have a once method - I'll see if I can speed this up)

I suspect 2 is slowing me down - any other thoughts?

But why is Fantom's reflection faster than Java/C#'s refelection? What are the bytecodes that make it faster?

because it's not using java/.net reflection or byte codes behind the scenes? It's using fcode interpreted by static java/.net to get that performance[1]

1 Ignore my guess and see Brian's answer below... I'll go back in my hole now :)

brian Fri 22 Jan 2010

But why is Fantom's reflection faster than Java/C#'s refelection? What are the bytecodes that make it faster?

I suspect Helium was comparing C# to Fantom on the JVM. The newer versions of HotSpot optimize reflection extremely well. Fantom reflection currently is backed by the VM's reflection (although could be optimized in the future).

Eventually I think the proper general purpose solution for this sort of code generation thing is using DSLs. But we need to do some work to make DSLs suitable for things other than expressions.

msl Mon 25 Jan 2010

Eventually I think the proper general purpose solution for this sort of code generation thing is using DSLs. But we need to do some work to make DSLs suitable for things other than expressions.

Ignoring the work that needs to be done, what would the end result of this look like? I'm struggling to think of how you'd use a DSL to give you a meaningful and simpler implementation of hash() and/or equals(Obj)

brian Mon 25 Jan 2010

Ignoring the work that needs to be done, what would the end result of this look like? I'm struggling to think of how you'd use a DSL to give you a meaningful and simpler implementation

The issue is that right now a DSL is strictly an expression. In order to support code generation DSLs, you'd need to enhance the DSL syntax to support class or slot modifiers.

KevinKelley Mon 25 Jan 2010

I've played with DSLs a little bit lately; after the bitwise-ops change, I wrote a BitSet class (packed array of longs), and a dsl to manipulate it with the old-style operators. And I ported a mini-Lisp (Stutter) to Fantom, and wrote a dsl to feed it.

For the bitset case, dsl-as-expression is fine; it's just sugar for the method calls. Trying to figure out how to make it more interesting, I keep running into wanting to add things to the containing method or class -- but at DSL-expansion time, I'm just a node down in the AST-tree-walk, with no link to my parents. And randomly changing the tree around myself would surely break things, anyway.

With the lispish toy, I seriously want to be able to affect the surrounding class -- maybe translate lisp forms into fan methods, maybe all kinds of weird/neat things. Not really sure yet how that all ought to work, or whether it even really should -- I'm just going with the theory that turing-complete means what it says. :-)

I'm not sure how this DSL stuff relates to msls equals/compare auto-generation; I guess they both involve modifying the compiled product, under programmer control, but outside the "normal" fantom source language.