Update EM-DAT Transformer For Accurate Disaster Data

by Admin 53 views
**Update EM-DAT Transformer for HIPs 2025: A Comprehensive Guide**

Hey everyone, let's dive into an important update for the EM-DAT transformer in the pystac-monty project! This update focuses on ensuring our disaster data is accurate, interoperable, and aligned with the latest standards, specifically the UNDRR-ISC 2025 codes. This guide will walk you through the current state, the required changes, testing procedures, and the expected outcomes.

Current State of the EM-DAT Transformer: The Challenge of Classification

Currently, the EM-DAT transformer, specifically at line 262 in pystac_monty/sources/emdat.py, stores the EM-DAT classification key directly. For example, it might store something like nat-hyd-flo-flo. While this is a start, it doesn't provide the full picture. The primary issue is the lack of proper mapping to UNDRR-ISC 2025 codes. This means the data isn't easily compatible with other systems and standards that use these codes. Additionally, it doesn't include GLIDE codes, which are crucial for cross-referencing disaster events.

Current Implementation: A Quick Look

monty.hazard_codes = [row.classif_key]  # Line 262

As you can see, the current implementation is pretty straightforward. It grabs the EM-DAT classification key, but that's where it stops. It doesn't:

  • Map to UNDRR-ISC 2025 codes.
  • Include GLIDE codes for broader interoperability.
  • Leverage the taxonomy cross-classification mapping table effectively.

This means that users have to do extra work to understand and utilize the data effectively. This update aims to fix that!

Reference Mapping: Linking EM-DAT to the UNDRR-ISC 2025 Standard

The key to unlocking the full potential of this data lies in mapping the EM-DAT classifications to UNDRR-ISC 2025 codes. According to the EM-DAT analysis documentation (EM-DAT analysis documentation), EM-DAT uses the EM-DAT CRED Classification Tree and should be mapped to the 2025 UNDRR-ISC code as the reference classification for the Monty extension. The goal is to provide a comprehensive and interoperable dataset. This also means that we need to include GLIDE codes for maximum interoperability. The complete mapping table, which includes all the various EM-DAT disaster types, is also available in the taxonomy documentation.

Example Mappings from Documentation: A Glimpse

Let's look at some examples to illustrate the mapping:

EM-DAT Classification Key GLIDE UNDRR-ISC 2025 (Reference) Cluster Description
nat-hyd-flo-flo FL MH0600 MH-WATER Flooding (chapeau)
nat-geo-ear-gro EQ GH0101 GEO-SEIS Earthquake
nat-met-sto-tro TC MH0306 MH-WIND Cyclone or Depression
nat-cli-dro-dro DR MH0401 MH-PRECIP Drought
nat-geo-vol-vol VO GH0201 GEO-VOLC Lava Flows
nat-geo-ear-tsu TS MH0705 MH-MARINE Tsunami
nat-cli-wil-for WF EN0205 ENV-FOREST Wildfires
nat-met-ext-hea HT MH0501 MH-TEMP Heatwave
nat-met-ext-col CW MH0502 MH-TEMP Cold Wave
nat-geo-mmd-lan LS GH0300 GEO-GFAIL Gravitational Mass Movement

As you can see, the mapping provides a bridge between EM-DAT's classification system, the GLIDE codes, and the standardized UNDRR-ISC 2025 codes. It allows for a more complete understanding and usage of the data.

Important Notes:

  • Interoperability: All three classification codes (GLIDE, EM-DAT, and UNDRR-ISC 2025) are essential for maximum interoperability. This ensures data can be easily shared and used across various platforms.
  • Comprehensive Mapping: The complete mapping table with all EM-DAT disaster types is available in the taxonomy cross-classification mapping. This table is the key to converting the EM-DAT data.
  • EM-DAT's Significance: EM-DAT has the most comprehensive classification tree, which makes a complete mapping implementation a priority.

Required Changes: Implementing the Transformation

The good news is that we can leverage existing resources! The EM-DAT classification keys are already integrated into the cross-classification system within HazardProfiles.csv. This means we can directly use the get_canonical_hazard_codes() function (from issue #111) without needing a custom mapping. This will streamline the process and make our code cleaner.

File to Modify: Where the Magic Happens

  • pystac_monty/sources/emdat.py: This is the file where we'll be making the necessary changes.

Changes Needed: Step-by-Step

Here's what needs to be done:

  1. Update the make_source_event_item() method (around line 262):
    • Keep the initial assignment: monty.hazard_codes = [row.classif_key]
    • Add a call to get_canonical_hazard_codes() to derive the full trio of codes.
      • This function automatically looks up the EM-DAT key in HazardProfiles.csv, derives the corresponding UNDRR 2025 code, and finds the associated GLIDE code if available.
      • It then returns the complete trio: [UNDRR 2025, EM-DAT, GLIDE]
  2. Add keyword generation using get_keywords():
    • Generate human-readable keywords for STAC discoverability. This will include hazard keywords and country names, to enhance search capabilities.

Example Update: Seeing the Changes in Action

To help you visualize the changes, let's compare the current code with the updated version:

Before (Current Code): The Baseline

def make_source_event_item(self, row: EmdatDataValidator) -> Optional[Item]:
    # ... geometry and datetime creation ...

    # Add Monty extension
    MontyExtension.add_to(item)
    monty = MontyExtension.ext(item)
    monty.episode_number = 1  # EM-DAT doesn't have episodes
    monty.hazard_codes = [row.classif_key]  # CURRENT: Only EM-DAT code
    monty.country_codes = [row.iso] if row.iso else []

    monty.compute_and_set_correlation_id()
    # ... rest of method ...

After (Updated Code): The Transformation

def make_source_event_item(self, row: EmdatDataValidator) -> Optional[Item]:
    # ... geometry and datetime creation ...

    # Add Monty extension
    MontyExtension.add_to(item)
    monty = MontyExtension.ext(item)
    monty.episode_number = 1  # EM-DAT doesn't have episodes

    # Set initial hazard code from EM-DAT
    monty.hazard_codes = [row.classif_key]
    monty.country_codes = [row.iso] if row.iso else []

    # NEW: Normalize to canonical trio (UNDRR 2025 + EM-DAT + GLIDE)
    monty.hazard_codes = self.hazard_profiles.get_canonical_hazard_codes(item)

    # NEW: Generate keywords for discoverability
    hazard_keywords = self.hazard_profiles.get_keywords(monty.hazard_codes)
    country_keywords = [row.country] if row.country else []
    item.properties["keywords"] = list(set(hazard_keywords + country_keywords))

    monty.compute_and_set_correlation_id()
    # ... rest of method ...

The updated code now includes the call to get_canonical_hazard_codes() to retrieve the complete trio of hazard codes, which is a major enhancement. Also, it implements keywords.

Testing: Ensuring Accuracy and Interoperability

Testing is crucial to ensure that our changes are working correctly and that the data is accurate. We'll implement unit tests, integration tests, and manual verification to cover all bases.

Unit Tests to Add/Update: Focusing on Specific Functionality

  1. Test event item generation with mapped codes:

    def test_emdat_event_item_has_all_codes():
        transformer = EMDATTransformer(...)
    
        # Create mock row with flood classification
        row = EmdatDataValidator(classif_key="nat-hyd-flo-flo", ...)
        event_item = transformer.make_source_event_item(row)
    
        monty = MontyExtension.ext(event_item)
        assert monty.hazard_codes == ["MH0600", "nat-hyd-flo-flo", "FL"]
    

This unit test will confirm that when an event item is generated with a specific EM-DAT classification (e.g., flood), the resulting monty.hazard_codes contains all three codes: UNDRR 2025, EM-DAT, and GLIDE.

Integration Testing: Testing with Real Data

  1. Test with real EM-DAT Excel/JSON data:
    • Process the complete EM-DAT dataset extract.
    • Verify all classification keys are mapped correctly.
    • Check that generated STAC items have the correct codes.
  2. Test backward compatibility:
    • Ensure that queries using EM-DAT codes still work.
    • Verify that queries using GLIDE codes work as expected.
    • Confirm that queries using UNDRR-ISC 2025 codes function correctly.
  3. Test hazard and impact items:
    • Confirm that hazard items have the correct codes.
    • Verify that impact items inherit the correct codes from the parent event.

Manual Verification: A Hands-On Approach

  1. Download a recent EM-DAT data export.
  2. Process the data through the transformer.
  3. Inspect the STAC items for various disaster types.
  4. Verify the monty:hazard_codes arrays for accuracy.
  5. Test search and filter operations using different code types to ensure they function as expected.

Acceptance Criteria: Defining Success

To ensure this update meets our goals, we have specific acceptance criteria:

  • [ ] Complete mapping implemented for all EM-DAT disaster classifications.
  • [ ] All natural disasters (Hydrological, Meteorological, Climatological, Geophysical, Biological) are mapped.
  • [ ] Technological disasters are mapped.
  • [ ] Societal disasters are mapped.
  • [ ] Event items contain all three code types (2025, EM-DAT, GLIDE).
  • [ ] Hazard items use the correct classification codes.
  • [ ] Impact items correctly inherit codes from events.
  • [ ] Unit tests pass for all major EM-DAT categories.
  • [ ] Integration tests verify the correct STAC generation from real data.
  • [ ] Unmapped classifications are handled gracefully with proper logging.
  • [ ] Code includes comprehensive comments and documentation.
  • [ ] Mapping matches the official taxonomy cross-classification table.

These criteria guarantee that the update is thorough and produces high-quality, interoperable data. These will ensure data quality and compatibility across different platforms and applications.

Related Documentation: Additional Resources

For more in-depth information, check out the following resources:

Estimated Effort: Time to Completion

We estimate that this project will take around 6-8 hours to complete. This accounts for the complexity of the EM-DAT classification tree, which covers a wide range of disaster types (over 50 classification keys). However, by leveraging existing resources and the straightforward implementation, we can finish it efficiently.

This update will significantly improve the accuracy, interoperability, and usefulness of the EM-DAT data within the pystac-monty project, making it easier for everyone to access and utilize this critical disaster information. Good luck, and happy coding, guys!