Skip to content
paulalbert1 edited this page Mar 27, 2019 · 94 revisions

ReCiter is a system for making highly accurate guesses about author identity in publication metadata. ReCiter includes a Java application, a DynamoDB-hosted database, and a set of RESTful microservices which collectively allow institutions to maintain accurate and up-to-date author publication lists for thousands of people. This software is optimized for disambiguating authorship in PubMed and, optionally, Scopus.

ReCiter rapidly and accurately identifies articles, including those at previous affiliations, by a given person. It does this by leveraging institutionally maintained identity data (e.g., departments, relationships, email addresses, year of degree, etc.) With the more complete and efficient searches that result from combining these types of data, you can save time and your institution can be more productive. If you run ReCiter daily, you can ensure that the desired users are the first to learn when a new publication has hit PubMed.

ReCiter is freely available and open source.

Functionality

Data sources

  • Institution-specific identity data
    • name variants (such as nicknames, name changes, and spelling irregularities)
    • current and former institutional affiliations
    • departmental or other organizational unit affiliations
    • e-mail addresses (personal, institutional, etc.)
    • years of degree (bachelor and any terminal degree)
    • grant identifiers
    • relationships (co-investigatorships, mentor/mentee, people in shared organizational unit, manager, etc.)
    • common institutional affiliations (used for everyone)
    • individual institutions (e.g., undergraduate, doctoral, residency, internship, clinical affiliation; used to limit results when someone has an especially common name)
    • institutions which frequently collaborate with your institution
  • PubMed search engine, which primarily accesses the Medline database
  • Scopus (optional), a bibliographic database used to harvest affiliations. See Configuring Scopus

Algorithm

Some examples of reasoning used in ReCiter include:

  • Dr. Y is more likely to have written this article. It lists his department in the author affiliation field.
  • Dr. X couldn't have written this article. It was published eight years before she got her Bachelor's degree.
  • Dr. Z is more likely to have written this article. It lists two authors who are also included as co-investigators on an active grant.

All together, ReCiter can use up to 11 pieces of evidence for a given individual as well as eight different features that allow us to cluster articles with each other.

The article How ReCiter works and issues it links to details how the algorithm works in more detail.

Value to institutions

Whether ReCiter is useful depends upon how publication data is managed at your institution. ReCiter is certainly useful at Weill Cornell. Consider the following use case: The Department of Pediatrics tells you that the institution is about to hire a new prominent faculty member from Harvard, Dr. Andrew Schwartz, and wants to have an updated list of publications right away. You don't want to be responsible for a public-facing profile which is plainly incomplete.

  • You reach out to a variety of parties: the department, Dr. Schwartz, Office of Faculty Affairs, etc. requesting a CV, but no one gets back to you. You'll have to do this on your own.
  • The PubMed interface retrieves over 3,500 results when searching for "A Schwartz" as an author.
  • We can assume that most of these candidate publications are not authored by the "A Schwartz" we are looking for! Even the ones with a Harvard affiliation may be written by another A. Schwartz.
  • You notice that sometimes this author goes by "Andy Schwartz", is inconsistent about including his middle initial, and sometimes doesn't include an affiliation.
  • You search for a LinkedIn, Google Scholar, Academia.edu, ORCID or other such profile - any one of these could be incomplete - and find a handful of publications.
  • You look through the candidate articles for clues. What's the institutional affiliation? What department is listed in the affiliation string? Is a known institutional email listed? Are there any grants indexed on which A. Schwartz was listed as a co-author.
  • You're feeling especially attentive to detail, so you look at see if certain journals or co-authors names from a known publication (where Dr. Schwartz's email is indexed), are shared with other candidate publications.
  • This isn't your first rodeo, so you complete this work in about 20 minutes, but... cue dramatic organ music, at what cost!?

Properly populated with identity data, the ReCiter system can handle all this for you in one go, quickly and accurately. It thinks like a librarian and shows its work like a math teacher.

Furthermore, ReCiter is fast enough that it can be set up to work on a daily schedule for thousands of people. At Weill Cornell, we've had disambiguation services marketed to us which update publications only every year. Academia prizes knowledge, and everyone hates to be out of the loop. This allows you to offer everyone from a delegate for a faculty to the deans and provosts, near real-time knowledge of which publications have appeared in PubMed.

At Weill Cornell Medicine, we use ReCiter for all full-time faculty (n=1700) and PhD/MD-PhD students (n=650). We are looking to expand its use for PhD alumni, among others.

Accuracy

ReCiter's accuracy (an average of precision and recall as benchmarked against a human-curated gold standard) has been measured at over 95% for current full-time faculty at Weill Cornell Medicine. But, the exact accuracy of a given person depends on a variety of factors, especially:

  • How much identity data you can provide the ReCiter algorithm
  • How common a person's name is
  • How prolific an author has been. For example, ReCiter would typically perform far better for a long-time faculty with a unique name and a lot of publications under their belt as opposed to a student with a common name and only a couple publications. Knowing the email a faculty used at a prior affiliation (your Office of Faculty Affairs has this, at least at Weill Cornell) is a huge help. ReCiter will never be 100% accurate. Data on ReCiter's accuracy at Weill Cornell is available here.

Privacy

  • Data from PubMed and Scopus about published articles is already publicly available.
  • Each institution can set its own access rules for personnel information that is used by ReCiter to perform its searches. At Weill Cornell, the only non-public data we use is personal email and names of mentees, and we don't expose these data.
  • ReCiter will only run for authors whose data you have populated into ReCiter.

Identifying the publications of authors at other institutions

ReCiter depends on institutionally-maintained data to make highly accurate assertions of author identity in publication metadata. The more you know about a given person, the better ReCiter will perform.

What about ORCID?

ORCID is a persistent digital identifier designed to distinguish one researcher from every other researcher. Users create an account at orcid.org and manually claim their publications. Some publishers now require that submitters include their ORCID ID, and there are efforts by institutions, especially libraries, to increase adoption. Some issues we have noticed with ORCID:

  • Less than 1% of articles that Weill Cornell cares about have an ORCID identifier indexed in PubMed for even one author.
  • There are a number of duplicate ORCID profiles.
  • The False Negative Problem: A new candidate publication appeared three months ago. Is it not in the person's ORCID profile because he didn't get around to adding it, or because he simply didn't author it?... Our administrators like to be "in the know." Ideally, we would tell them of a new authorship the day it was indexed in PubMed, so this poses a problem.
  • Our users are notoriously overwhelmed. Absent any significant carrot or stick, getting them to maintain a profile in yet another system is an exercise in frustration. It seems trivial, but even getting them to assign the library as delegates would likewise require a lot of persistence.
  • For reporting purposes, we attempt to track publications authored by users that we can no longer contact and would have trouble encouraging them to clean up their ORCID profile. This includes publications by alumni and inactive faculty. These are individuals who would never give us proxy access to their ORCID profiles. For these reasons, ORCID is not quite mature enough for Weill Cornell Medicine to be able to count on it as a reliable source of truth for author identity. If and when that changes, and it starts to become a valuable source of data, ReCiter could be modified to also include ORCID's assertions of authorship.

Going forward, our team intends to explore how ORCID could serve as a source or target for ReCiter.

Follow up

If you have questions about this tool, would like support, or would like to make a contribution to ReCiter, please use this contact info:

Citing this work

The original ReCiter algorithm may be cited as follows: Johnson SB, Bales ME, Dine D, Bakken S, Albert PJ, Weng C. Automatic generation of investigator bibliographies for institutional research networking systems. Journal of Biomedical Informatics 2014;51:8–14. Available from URL: http://dx.doi.org/10.1016/j.jbi.2014.03.013

License

The source code (Copyright 2018, Weill Cornell Medical College) is licensed under the Apache License, Version 2.0 (the “License”); you may not use ReCiter except in compliance with the License. You may obtain a copy of the License at http://www.apache.or/licenses/LICENSE-2.0. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “as is” basis, without warranties or conditions of any kind, either express or implied. See the License for the specific language governing permissions and limitations under the License.