Annual Meeting of the NCI Cohort Consortium (Abstract Submission): Submission #13

Submission information
Submission Number: 13
Submission ID: 127583
Submission UUID: 370c5d24-f379-4b85-9d20-69afbb707912

Created: Fri, 09/13/2024 - 16:50
Completed: Fri, 09/13/2024 - 16:55
Changed: Mon, 09/16/2024 - 16:40

Remote IP address: 10.208.28.69
Submitted by: Anonymous
Language: English

Is draft: No
Lightning Talks Abstract
Martin
Lajous
Faculty-Researcher
MD, ScD
Instituto Nacional de Salud Publica
An Efficient Pipeline-Based Geocoding Approach to Handle Self-Reported Addresses in a Large Population-based Cancer Cohort in Mexico
  1. First Name: Alejandro
    Last Name: Molina-Villegas
    Degree(s): PhD
    Organization: CONAHCyT-CentroGeo
  2. First Name: Karla
    Last Name: Valdez-Trejo
    Degree(s): MS
    Organization: Instituto Nacional de Salud Publica
  3. First Name: Pablo
    Last Name: Lopez-Ramires
    Degree(s): PhD
    Organization: CentroGeo
  4. First Name: Alberto
    Last Name: Simpser
    Degree(s): PhD
    Organization: ITAM
  5. First Name: Adrian
    Last Name: Cortes-Valencia
    Degree(s): MS
    Organization: Instituto Nacional de Salud Publica
  6. First Name: Dalia
    Last Name: Stern
    Degree(s): PhD
    Organization: CONAHCyT-Instituto Nacional de Salud Publica
  7. First Name: Karla
    Last Name: Cervantes-Martinez
    Degree(s): PhD
    Organization: Instituto Nacional de Salud Publica
  8. First Name: Liliana
    Last Name: Gomez-Flores-Ramos
    Degree(s): PhD
    Organization: Instituto Nacional de Salud Publica
Background. Geocoding participants’ addresses in epidemiologic cohorts is now highly accurate in high-income countries. Non-standardized address notation, lack of address registries, and limitations on geocoding resources are important challenges for geocoding in limited resource settings. We aimed to develop an efficient pipeline-based geocoding approach to handle self-reported addresses from participants in a cancer cohort in Mexico, assess the validity of coordinate assignment, and maximize geocoding success.

Methods. We obtained self-reported addresses at baseline in 2006-2008 from 104,003 participants in the Mexican Teachers’ Cohort (n=115,275). After cleaning and standardization, we optimized processing times by splitting the data (651,668 candidate coordinates) and creating 105 Amazon AWS virtual machines to submit queries asynchronously to the ArcGIS REST API. We conducted geospatial verification by projecting candidate coordinates through spatial join operation on Mexico’s official neighborhood vector shapefile. We compared similarities between the self-reported and API-derived addresses using string alignment scoring metrics. To assess accuracy of the procedure we compared address coordinates to residential block-centroid coordinates available in the 2006 national voting registry database.

Results. After discarding non-valid coordinates and conducting geospatial verification and similarity scoring, we assigned coordinates to 101,704 study participants. When we compared assigned coordinates to voting registry block-centroid coordinates for 81,270 participants, the median distance between coordinates was 0.17 km (inter quartile range, 0.06-0.77). We maximized geocoding to 111,299 (97%) study participants by assigning voting registry-defined coordinates to 9,595 participants without a valid address.

Conclusions. Address-level geocoding based on self-reported addresses can be efficiently achieved in large-scale epidemiological studies in Mexico.