Author: He XingChen
Last Updated: 2026-02-12

Datasets Overview

SeisPolarity provides a unified interface to access and process multiple seismic polarity datasets. All datasets are stored in HDF5 format for efficient storage and streaming.

Available Datasets

Dataset

Source

Size

Classes

Format

SCSN

Southern California Seismic Network

Large

U/D/N

HDF5

Txed

Texas Earthquake Dataset

Medium

U/D

HDF5

DiTing

China Earthquake Networks Center

Medium

U/D/N

HDF5

Instance

Global seismic data

Medium

U/D/N

HDF5

PNW

Pacific Northwest

Large

U/D/N

HDF5

Datasets

SCSN

The Southern California Seismic Network (SCSN) dataset contains polarity-labeled seismic waveforms from the Southern California Seismic Network. This dataset covers earthquake events from 2000-2020 and is a high-quality dataset with manual annotations.

SCSN

Warning

Dataset size: waveforms.hdf5 ~660Gb, metadata.csv ~2.2Gb Polarity subset SCSN ~15 GB

Citation

Cheng, Y., Ross, Z. E., Hauksson, E., Ben-Zion, Y. (2023). Refined earthquake focal mechanism catalog for southern California derived with deep learning algorithms. Journal of Geophysical Research: Solid Earth, 128, e2022JB025975. https://doi.org/10.1029/2022JB025975

Ross, Z. E., Meier, M.-A., Hauksson, E. (2018). P wave arrival picking and first-motion polarity determination with deep learning. Journal of Geophysical Research: Solid Earth, 123, 5405-5416. https://doi.org/10.1029/2018JB015510


Txed

The Texas Earthquake Dataset (TXED) is a regional seismic signal benchmark dataset from Texas. This dataset contains a large number of earthquake events and noise waveforms, serving as an important data resource for machine learning in seismology.

Txed

Balanced Strategy

Recommended strategy: Polarity Inversion (1:1:1)

This strategy creates a balanced dataset with equal proportions of Up, Down, and Unknown samples:

  • Each Up and Down sample generates two samples (original + polarity-inverted)

  • Unknown samples are added to match the total count of (Up + Down) samples

  • Final distribution: Up = 1/3, Down = 1/3, Unknown = 1/3

1from seispolarity import BalancedPolarityGenerator
2
3generator = BalancedPolarityGenerator(
4    dataset,
5    strategy="polarity_inversion"
6)

Warning

Dataset size: waveforms.hdf5 ~70Gb, metadata.csv 120Mb

Citation

Chen, Y., Savvaidis, A., Saad, O. M., Huang, G.-C. D., Siervo, D., O’Sullivan, V., McCabe, C., Uku, B., Fleck, P., Burke, G., Alvarez, N. L., Domino, J., & Grigoratos, I. (2024). TXED: The Texas Earthquake Dataset for AI. Seismological Research Letters, 95(6), 1-13. https://doi.org/10.1785/0220230327


DiTing

The DiTing dataset is a large-scale Chinese seismic benchmark dataset specifically designed for artificial intelligence seismology research. This dataset contains over 640,000 high-quality P-wave first-motion polarity labels, covering more than 1,300 broadband and short-period seismic stations across China.

Diting

Citation

Zhao, M., Xiao, Z., Chen, S., & Fang, L. (2023). DiTing: A large-scale Chinese seismic benchmark dataset for artificial intelligence in seismology. Earthquake Science, 36(2), 84-94. https://doi.org/10.1016/j.eqs.2022.01.022

Data Download

The dataset can be requested for download at: https://data.earthquake.cn/


Instance

The INSTANCE dataset is an Italian seismic waveform dataset compiled by the Italian National Institute of Geophysics and Volcanology (INGV), specifically designed for machine learning applications. This dataset contains nearly 1.2 million three-component waveform traces and serves as an important resource for seismological research.

Instance

Balanced Strategy

Recommended strategy: Polarity Inversion (1:1:1)

This strategy creates a balanced dataset with equal proportions of Up, Down, and Unknown samples:

  • Each Up and Down sample generates two samples (original + polarity-inverted)

  • Unknown samples are added to match the total count of (Up + Down) samples

  • Final distribution: Up = 1/3, Down = 1/3, Unknown = 1/3

1from seispolarity import BalancedPolarityGenerator
2
3generator = BalancedPolarityGenerator(
4    dataset,
5    strategy="polarity_inversion"
6)

Warning

Dataset size:

  • waveforms (counts) ~160Gb

  • waveforms (ground motion units) ~310Gb

Citation

Michelini, A., Cianetti, S., Gaviano, S., Giunchi, C., Jozinović, D., & Lauciani, V. (2021). INSTANCE – The Italian Seismic Dataset For Machine Learning. Earth System Science Data, 13, 5509–5542. https://doi.org/10.5194/essd-13-5509-2021


PNW

The Pacific Northwest (PNW) dataset is a machine learning-ready curated dataset containing diverse seismic signals from the Pacific Northwest region. This dataset is compiled by the Pacific Northwest Seismic Network and covers various seismic event types including earthquakes, explosions, and noise.

PNW

Balanced Strategy

Recommended strategy: Min-Based (1:1:1)

This strategy creates a balanced dataset by sampling equally from all classes up to the minimum class count:

  • Count samples in each polarity class (Up, Down, Unknown)

  • Determine the minimum count among all classes

  • Sample equally from each class up to the minimum count

  • Final distribution: Up = 1/3, Down = 1/3, Unknown = 1/3

1from seispolarity import BalancedPolarityGenerator
2
3generator = BalancedPolarityGenerator(
4    dataset,
5    strategy="min_based"
6)

Citation

Ni, Y., Hutko, A., Skene, F., Denolle, M., Malone, S., Bodin, P., Hartog, R., & Wright, A. (2023). Curated Pacific Northwest AI-ready Seismic Dataset. Seismica, 2(1), 368. https://doi.org/10.26443/seismica.v2i1.368

Loading Datasets

Automatic Download

SeisPolarity can automatically download datasets:

1from seispolarity import get_dataset_path, WaveformDataset
2
3# Download from Hugging Face (default)
4data_path = get_dataset_path("SCSN", "train", cache_dir="./datasets")
5
6# Or use ModelScope (recommended for users in China)
7data_path = get_dataset_path("SCSN", "train", use_hf=False)

Load from Local Files

 1from seispolarity import WaveformDataset
 2
 3# Disk streaming (suitable for large datasets)
 4dataset = WaveformDataset(
 5    path="data/scsn_train.hdf5",
 6    name="SCSN_Train",
 7    preload=False
 8)
 9
10# RAM preloading (suitable for small datasets)
11dataset = WaveformDataset(
12    path="data/scsn_train.hdf5",
13    name="SCSN_Train",
14    preload=True
15)

Dataset API

WaveformDataset

The main class for loading waveform data.

 1from seispolarity import WaveformDataset
 2
 3dataset = WaveformDataset(
 4    path="data.hdf5",          # HDF5 file path
 5    name="SCSN",               # Dataset name
 6    preload=False,             # Whether to preload into RAM
 7    data_key="X",              # HDF5 key for waveforms
 8    label_key="Y",             # HDF5 key for labels
 9    p_pick_position=300,      # P-wave arrival position
10    pick_key="p_pick",        # Use p_pick as P-wave arrival point
11    crop_left=200,             # Samples before P-pick
12    crop_right=200,            # Samples after P-pick
13    allowed_labels=[0, 1, 2]   # Allowed labels (0: Up, 1: Down, 2: Unknown)
14)

Data Format

Waveforms are stored in HDF5 files with the following structure:

waveforms.hdf5
├── X                # Waveform data (N_samples, N_channels)
├── Y                # P-value labels (N_samples,)
├── Z                # Clarity (only required for ditingmotion)   
├── metadata         # Additional metadata (optional)
└── ...

Label Encoding

  • 0: Up (positive polarity)

  • 1: Down (negative polarity)

  • 2: Unknown

DataLoader

Create a PyTorch DataLoader for training:

1loader = dataset.get_dataloader(
2    batch_size=1024,
3    num_workers=4,
4    shuffle=True,
5    pin_memory=True
6)

Data Inspection

Basic Statistics

1from seispolarity import WaveformDataset
2
3dataset = WaveformDataset(path="data.hdf5", name="SCSN")
4
5# Get dataset statistics
6print(f"Total samples: {len(dataset)}")
7print(f"Label distribution: {dataset.label_distribution}")
8print(f"Waveform shape: {dataset.waveform_shape}")

Multi-Dataset Training

Combine multiple datasets:

1from seispolarity import MultiWaveformDataset
2
3# Create multiple datasets
4dataset1 = WaveformDataset(path="scsn.hdf5", name="SCSN")
5dataset2 = WaveformDataset(path="txed.hdf5", name="Txed")
6
7# Combine them
8combined = MultiWaveformDataset([dataset1, dataset2])

Balanced Sampling

For datasets with label imbalance, use balanced sampling:

1from seispolarity import BalancedPolarityGenerator
2
3generator = BalancedPolarityGenerator(
4    dataset,
5    strategy="polarity_inversion"  # or "min_based"
6)
7loader = generator.get_dataloader(batch_size=256)

Download Locations

Datasets can be downloaded from:

  • Hugging Face: https://huggingface.co/datasets/chuanjun1978/Seismic-AI-Data

  • ModelScope: https://www.modelscope.cn/datasets/chuanjun/Seismic-AI-Data/ (recommended for users in China)

For more details, see the Installation Guide.