A simple idea for data cleanup on Geni

Started by Private User on Friday, May 6, 2011
Problem with this page?

Participants:

Showing all 28 posts
Private User
5/6/2011 at 2:31 PM

So there's a major data entry problem on Geni, and you can recreate it.

Pick a random person on your tree born in the 1800's or before who doesn't have a birth date or parents listed. The person should be marked as deceased. Add a father, and Geni will default the father to deceased, but then add a mother, and Geni will default that person to living!

And I see SO many of my collaborators (and even found a zombie that I had entered living in the 1700's) and so here's my suggestion.

I suggest that Geni engineers write a one time script to crawl through the Geni data with the goal of marking everyone who was born in the 1800's and earlier as dead:

http://en.wikipedia.org/wiki/List_of_living_supercentenarians

With 85 people in the world who are alive over the age of 110, and with the assumption of making up to 85 mistakes, I think this is a reasonable thing... so the script would work something like this:

* Anyone born before 1900 marked as dead, with the exception of an obvious mis-typed person, one who has 5 or more immediate siblings / spouses / children born AFTER 1900
* Anyone who has more than 5 siblings / spouses / children / parents / aunts / uncles / brother and sister in laws / first cousins born BEFORE 1800

And once the algorithm finds the majority of people this way, then a more broad aproach:

* Start with families for a LOT of data is had... for example, a family with more than a dozen birth dates among parents/children/grandchildren in the 1900's A family that clearly lived in the last 100 years. Now take generational steps back, and based on the age of those parents the algorithm picked, assume 35 years per generation, and marked all the grandparents, great-grandparents, ETC as deceased.

*You can also do the reverse... pick a family where there is a LOT of data that firmly cements a family in the 1700's, assume a HUGE generation gap, like 40 years, and iterate down until you're within 40 years of the 1900's and stop... marking everyone dead, all the way down.

Anyways, that is the gist. I have marked SO many individuals as dead from the 1800's and 1700's for myself and for collaborators, that I just think there is a need for this sort of script. These profiles I think are disruptive to merges, as one of the key things the match algorithm looks at is whether or not the person is alive or not, and rarely auto-matches a living person with a dead one. I bet this process would nab 500,000 profiles that fit this description. I've found and fixed easily 1,000 in my five months on Geni.

I guess I'm saying, I just didn't realize how bad the problem was until I found more on my own tree yesterday.

Private User
5/6/2011 at 7:33 PM

Stephen -

True!

This has been bantered about a few times. I'm not sure to what end. I'll draw this to the attention of one of the tech-types.

Private User
5/6/2011 at 11:38 PM

Thanks!

Yea, I recognize the tree is huge, and that there is probably more to it than meets the eye, but I think this would be a very valuable use of time for all around data cleanup.

Steve

Private User
5/7/2011 at 12:12 AM

Certainly.

I sent Mike a link to this thread. He said he'd take a look. So, at least this will be read and someone might be in touch.

Minimally, the mother defaulting to deceased should be a simple fix. Even preventing those profiles from slipping through the cracks helps.

5/7/2011 at 12:15 AM

Stephen and Private User,
what Geni really needs is a basic "exception report"
that any user can see on their "homepage". This will include "too old" people, when a preson is older the X, person who became parent younger than XX or older than YY, tree loops and many other that will be added later. any user will be able to define his own exception values

Private User
5/7/2011 at 3:05 AM

Any Pro can use advanced search to search for living profiles born before any given date, and can filter by Connected to You, Managed by You, etc.

Private User
5/7/2011 at 3:30 AM

Yes but I'm talking about profiles for whom no birth date is known.

Yaacov, yes that is done a number of ways on Geni already, but perhaps no way to search for them... profiles older than 125 years (both birth and death years) show up with a question mark next to their name. Also, when you type those people in, you get an immediate prompt saying how old they are.

5/7/2011 at 7:22 AM

Thanks for the report, if I'm understanding you, the problem is the default value for Living / Deceased when using "Add This Person" where the existing spouse and children are deceased. It seems to me that we should just default it to the same as the existing spouse, no?

I think you'll find that if you were to include a birth year when adding the wife, you'll get a warning if you try to make her living and > 125 years old. So I think the problem is when adding the mother and no birth date. We tried once before marking everyone older than 125 deceased, but it misses a great many profiles that have no birth date. The 5-generation trick is exactly what we do with the "zombie script" that any curator can run -- just post a link to the profile to the "Zombies please" discussion: http://www.geni.com/discussions/76010

5/7/2011 at 11:15 AM

Michael - there are several way to deal with it.
one of them - don't assume any living status and force the user to choose one of them
about the reports I mentioned earlier - they can be added to the Statistics area so anyone can see clearly his tree warnings

5/7/2011 at 3:37 PM

===
the problem is the default value for Living / Deceased when using "Add This Person" where the existing spouse and children are deceased.
===

Yes, I believe that is the scenario I've run across.

The problem with the zombie script, in this situation, is someone "on the veil of privacy" (i.e. with a claimed profile within the 5 generation rule).

In my own family, unless I specifically re-set to "public / deceased," that could be 1823 ... and *not* what my family wishes (they want those profiles "public," not private).

Private User
5/7/2011 at 5:30 PM

Yaacov Glezer, I really like the idea of conflicts, warnings, incomplete locations, etc. being shown in 'Statistics' or some other similar location.

Have you initiated a 'feature request' yet in the 'Geni Help?' http://help.geni.com/forums/337266-feature-requests

5/7/2011 at 7:53 PM

Private User, In the past I wrote some ideas includeing this one, but I got a "cold answer", so when it's relevant I prefer to post my ideas on the discussions, hoping someone with "connections" will see it and pass on to Geni

Private User
5/8/2011 at 2:02 AM

Erica, yes exactly. I'm talking about a one time script that crawls through the Geni tree and marks people who lived in the 1800's and earlier as dead. Both public and private profiles, since the extreme majority of profiles that exist on Geni are people who have lived in the past 300 years, this means that MANY of those people (likely 70+% based on generational statistics) are in private trees. I'm saying mark these hundreds of thousands of profiles as dead because they are causing problems with the match suggestion algorithm. That and of course they were merely mistakenly entered when typed in originally.

5/8/2011 at 2:11 AM

i think it was mentioned in the past that if geni were to do the one time script they would have to take the site down for a day or two so that wasnt going to happen

Private User
5/8/2011 at 4:32 AM

Maybe we could have a vote - all those in favour of a one or two day downtime to run the script for the common good for the future? Would the site really have to be taken down to run it? Perhaps only bits would need to be taken down at any one time?

Stephen asserted: "one of the key things the match algorithm looks at is whether or not the person is alive or not, and rarely auto-matches a living person with a dead one"

Mike Stangel: Is that correct?

Mike: If I am not mistaken, I recall you recently saying that the tree match suggestion algorithm ignores the living status because so many are wrong. So Stephen's observation that dead profiles marked live are causing problems with the match suggestion algorithm should not be the case? Perhaps we could try a controlled experiment: find a suggested match, change the living status of one of the profiles and see if the match is no longer suggested. Change it back and see if the match is now suggested again.

LOL David,
Geni has a couple million actual users. How many votes do you think s poll will get?

5/8/2011 at 9:08 AM

I believe the search match algorithm ignores living/deceased, but I'd have to dig up the details to say for sure.

I've been quiet on this because I thought the problem wasn't as widespread enough to be alarmed; we have the zombie script for profiles with 5 generations of descendants, and we generally stay out of people's private trees. I ran a query to see how many profiles we have that are marked living but born before 1886 -- the answer is 36,427. I've taken the liberty to mark them all deceased, and I created a ticket to fix the "Add This Person" bug.

5/8/2011 at 1:58 PM

Mike

Excellent! Thank you very much.

We may get some "issues" but I think those can be resolved on a case by case basis if Geni Customer Service is alert and aware.

Dealing with 36K profiles though thru query tools etc is far more efficient than one at a time research ....

Private User
5/8/2011 at 7:59 PM

Mike, that is really awesome! But that still only nabbed those profiles with birth dates, right?

So you're saying anyone with 5 generations of descendants listed is also nabbed? Perhaps you could expand that zombie script to include anyone who might not have 5 generations of descendants of their own, but has a sibling who has 5 generations of descendants as well? I think this would get the extreme majority of those that remain.

Also, perhaps anyone with a first cousin who has 6 generations of descendants....? In my observation, the profiles most likely to remain as zombies are "off the beaten path" and are almost strictly, for example, a list of siblings for a great-great-great grandparents, or perhaps a great-great-great-grandparents' siblings children.... And may I point out that this is also where the MAJORITY of merges happen.... two distant segments of families who are related, finding each other through lists of third and fourth great grandparents' siblings.... Families that are often in different countries, states, and have never known that each other exists...

At least, this is my experience on Geni. I've found SO many sets of long lost relatives whom some Geni member is a descendant of.

But also Mike, yes, I am quite certain that I have seen the match algorithm not match a living person with a dead person. For example, cases where I'm sending merge requests for a whole family, and one person doesn't show up as matched, but then when I am certain the whole family is a match, sometimes I have to change the person's death status, and then request the merge and then often change my death status back if I'm certain they are alive or dead.

I've seen this many times. With the new privacy settings, when you're trying to send a complete set of merge requests to another user, you can often see that a profile you've merging with has the sibling that matches your sibling, and sometimes changing the living status helps the algorithm see that it is indeed a match.

Thanks so much for your help!

Steve

5/8/2011 at 8:26 PM

Agreed - it wasn't hundreds of thousands, but it wasn't hundreds either...

Private User
5/8/2011 at 9:16 PM

I would bet money that it's hundreds of thousands if the zombie detection script was modified to include people whose siblings have 5 generations of descendants, and people with first cousins who have five generations of descendants!

I know that might be more difficult, but I appreciate help, on a Sunday no less!

5/8/2011 at 9:19 PM

You should try running the zombie script a little higher in the tree -- if it detects a descendant branch of, say, 7 generations then it will process profiles within radius 2 in any direction. Find someone with 10 generations of descendants, and it will process everyone with radius 5.

5/8/2011 at 10:52 PM

2 points:
- I have seen the match algorithm match a living and a dead person.
- In the horizontal-ancestry-tree form, the rule used to be that the default for the 2 leftmost generations was "living", and the default for the 3 rightmost generations was "dead". This is of course bogus when the root of the tree was born in the 1700s - setting the default to "dead" when the root is either born prior to 1900 or has > 4 levels of descendants would fix this issue (if it's not already fixed).

Private User
5/9/2011 at 2:41 AM

Good idea Harald.

Mike, So I admit I'm completely new to the site, as I've only been here a few months, but are you saying there is a zombie script that I can personally run for myself and my collaborators/family group members? If so, how do I do this?

Steve

Steve,
the zombie script is NOT available to all users. BUT if you have a specific profile that you suspect is a zombie or has zombies near it, feel free to post a link to it in the following discussion (you might want to bookmark or follow it) and a Curator will take care of your request ASAP.

"Zombies Please" discussion: http://www.geni.com/discussions/76010
Curators: http://wiki.geni.com/index.php/Curators

Private User
5/9/2011 at 2:39 PM

Ohhhhh, I seee!!! I thought the zombie thread was for dealing with individual Zombies, I had no idea it did a multiple generation radius and nabs every zombie in the vicinity. This is REALLY good to know!

The next time I come across a collaborator's tree with a hundred zombies, I'll just mention it there instead of fixing them all.

Private User
5/29/2011 at 4:16 PM

So there is a flaw with the Zombie script. The zombie script is genius, but it will never be able to accurately find Zombies for areas of the tree that do not have descendants down to the "present day", and with so many trees petering out with people who lived in the 1800's, those people will never be identified as zombies by the zombie script.

Zombie script explanation from Mike in an above post: "You should try running the zombie script a little higher in the tree -- if it detects a descendant branch of, say, 7 generations then it will process profiles within radius 2 in any direction. Find someone with 10 generations of descendants, and it will process everyone with radius 5."

Alright, another new idea for data cleanup....

So the last few days I've fixed another 500 zombies manually, because they have no birth date or death date. And with simply marking all people born before 1886 as deceased found and dealt with 36,427 Zombies (Thanks Mike Stangel!) my next idea can knock out the rest.... and I really do think this will get 95% of all remaining zombies, and I do think it's well over a quarter million in total. I fix more than 5 a day every day at this point, so there are a TON left out there in the wild.

Script 1 - For anyone born before 1850, mark all of their spouses without birthdates as deceased, and then immediately run script 2 on these people.

Script 2 - Would mark all of the ancestors of a profile as deceased, (out to great-great-grandparents and great-great aunts/uncles) The profiles script 2 would act on would primarily be profiles without birthdates that is mathematically determined to be deceased (like in Script 1)

Script 3 - For any profile born before 1910, run script 2 on them, which would mark their parents and great aunts and uncles for those profiles with out birth or death years.

I recognize that running scripts across the whole tree are a big task, but since my script would only act on a profile that doesn't have birth or death years, and is marked as alive. Also these scripts need to run within private trees as well. I find a large number of my collaborators kind of "give up" on Geni after they realize that the Gedcom upload they did years ago, Geni interpreted all of their ancestors without birth or death dates as "alive" and when a person is confronted with 1,000-10,000 profiles they have to manually mark as deceased, one at a time, through a flash pane on a website, they just give up and stop using Geni.

So I think the quality of Geni's data would be very well served (especially in that zone where private tree meets public, 1850-1600, I would bet money that a quarter of a million+ zombies exist, mostly in private trees.

As I said, I have been manually discovering and fixing more than 5 a day, every day since I joined the Geni.

Private User
5/29/2011 at 4:48 PM

Identifying, and slaying, undated spouses of pre-1850's (now deceased - thanks Mike Stangel) is the next logical step. I like it.

Showing all 28 posts

Create a free account or login to participate in this discussion