Datasets Overview
SeisPolarity provides a unified interface to access and process multiple seismic polarity datasets. All datasets are stored in HDF5 format for efficient storage and streaming.
Available Datasets
Dataset |
Source |
Size |
Classes |
Format |
|---|---|---|---|---|
SCSN |
Southern California Seismic Network |
Large |
U/D/N |
HDF5 |
Txed |
Texas Earthquake Dataset |
Medium |
U/D |
HDF5 |
DiTing |
China Earthquake Networks Center |
Medium |
U/D/N |
HDF5 |
Instance |
Global seismic data |
Medium |
U/D/N |
HDF5 |
PNW |
Pacific Northwest |
Large |
U/D/N |
HDF5 |
Datasets
SCSN
The Southern California Seismic Network (SCSN) dataset contains polarity-labeled seismic waveforms from the Southern California Seismic Network. This dataset covers earthquake events from 2000-2020 and is a high-quality dataset with manual annotations.
Warning
Dataset size: waveforms.hdf5 ~660Gb, metadata.csv ~2.2Gb Polarity subset SCSN ~15 GB
Citation
Cheng, Y., Ross, Z. E., Hauksson, E., Ben-Zion, Y. (2023). Refined earthquake focal mechanism catalog for southern California derived with deep learning algorithms. Journal of Geophysical Research: Solid Earth, 128, e2022JB025975. https://doi.org/10.1029/2022JB025975
Ross, Z. E., Meier, M.-A., Hauksson, E. (2018). P wave arrival picking and first-motion polarity determination with deep learning. Journal of Geophysical Research: Solid Earth, 123, 5405-5416. https://doi.org/10.1029/2018JB015510
Txed
The Texas Earthquake Dataset (TXED) is a regional seismic signal benchmark dataset from Texas. This dataset contains a large number of earthquake events and noise waveforms, serving as an important data resource for machine learning in seismology.
Balanced Strategy
Recommended strategy: Polarity Inversion (1:1:1)
This strategy creates a balanced dataset with equal proportions of Up, Down, and Unknown samples:
Each Up and Down sample generates two samples (original + polarity-inverted)
Unknown samples are added to match the total count of (Up + Down) samples
Final distribution: Up = 1/3, Down = 1/3, Unknown = 1/3
1from seispolarity import BalancedPolarityGenerator
2
3generator = BalancedPolarityGenerator(
4 dataset,
5 strategy="polarity_inversion"
6)
Warning
Dataset size: waveforms.hdf5 ~70Gb, metadata.csv 120Mb
Citation
Chen, Y., Savvaidis, A., Saad, O. M., Huang, G.-C. D., Siervo, D., O’Sullivan, V., McCabe, C., Uku, B., Fleck, P., Burke, G., Alvarez, N. L., Domino, J., & Grigoratos, I. (2024). TXED: The Texas Earthquake Dataset for AI. Seismological Research Letters, 95(6), 1-13. https://doi.org/10.1785/0220230327
DiTing
The DiTing dataset is a large-scale Chinese seismic benchmark dataset specifically designed for artificial intelligence seismology research. This dataset contains over 640,000 high-quality P-wave first-motion polarity labels, covering more than 1,300 broadband and short-period seismic stations across China.
Citation
Zhao, M., Xiao, Z., Chen, S., & Fang, L. (2023). DiTing: A large-scale Chinese seismic benchmark dataset for artificial intelligence in seismology. Earthquake Science, 36(2), 84-94. https://doi.org/10.1016/j.eqs.2022.01.022
Data Download
The dataset can be requested for download at: https://data.earthquake.cn/
Instance
The INSTANCE dataset is an Italian seismic waveform dataset compiled by the Italian National Institute of Geophysics and Volcanology (INGV), specifically designed for machine learning applications. This dataset contains nearly 1.2 million three-component waveform traces and serves as an important resource for seismological research.
Balanced Strategy
Recommended strategy: Polarity Inversion (1:1:1)
This strategy creates a balanced dataset with equal proportions of Up, Down, and Unknown samples:
Each Up and Down sample generates two samples (original + polarity-inverted)
Unknown samples are added to match the total count of (Up + Down) samples
Final distribution: Up = 1/3, Down = 1/3, Unknown = 1/3
1from seispolarity import BalancedPolarityGenerator
2
3generator = BalancedPolarityGenerator(
4 dataset,
5 strategy="polarity_inversion"
6)
Warning
Dataset size:
waveforms (counts) ~160Gb
waveforms (ground motion units) ~310Gb
Citation
Michelini, A., Cianetti, S., Gaviano, S., Giunchi, C., Jozinović, D., & Lauciani, V. (2021). INSTANCE – The Italian Seismic Dataset For Machine Learning. Earth System Science Data, 13, 5509–5542. https://doi.org/10.5194/essd-13-5509-2021
PNW
The Pacific Northwest (PNW) dataset is a machine learning-ready curated dataset containing diverse seismic signals from the Pacific Northwest region. This dataset is compiled by the Pacific Northwest Seismic Network and covers various seismic event types including earthquakes, explosions, and noise.
Balanced Strategy
Recommended strategy: Min-Based (1:1:1)
This strategy creates a balanced dataset by sampling equally from all classes up to the minimum class count:
Count samples in each polarity class (Up, Down, Unknown)
Determine the minimum count among all classes
Sample equally from each class up to the minimum count
Final distribution: Up = 1/3, Down = 1/3, Unknown = 1/3
1from seispolarity import BalancedPolarityGenerator
2
3generator = BalancedPolarityGenerator(
4 dataset,
5 strategy="min_based"
6)
Citation
Ni, Y., Hutko, A., Skene, F., Denolle, M., Malone, S., Bodin, P., Hartog, R., & Wright, A. (2023). Curated Pacific Northwest AI-ready Seismic Dataset. Seismica, 2(1), 368. https://doi.org/10.26443/seismica.v2i1.368
Loading Datasets
Automatic Download
SeisPolarity can automatically download datasets:
1from seispolarity import get_dataset_path, WaveformDataset
2
3# Download from Hugging Face (default)
4data_path = get_dataset_path("SCSN", "train", cache_dir="./datasets")
5
6# Or use ModelScope (recommended for users in China)
7data_path = get_dataset_path("SCSN", "train", use_hf=False)
Load from Local Files
1from seispolarity import WaveformDataset
2
3# Disk streaming (suitable for large datasets)
4dataset = WaveformDataset(
5 path="data/scsn_train.hdf5",
6 name="SCSN_Train",
7 preload=False
8)
9
10# RAM preloading (suitable for small datasets)
11dataset = WaveformDataset(
12 path="data/scsn_train.hdf5",
13 name="SCSN_Train",
14 preload=True
15)
Dataset API
WaveformDataset
The main class for loading waveform data.
1from seispolarity import WaveformDataset
2
3dataset = WaveformDataset(
4 path="data.hdf5", # HDF5 file path
5 name="SCSN", # Dataset name
6 preload=False, # Whether to preload into RAM
7 data_key="X", # HDF5 key for waveforms
8 label_key="Y", # HDF5 key for labels
9 p_pick_position=300, # P-wave arrival position
10 pick_key="p_pick", # Use p_pick as P-wave arrival point
11 crop_left=200, # Samples before P-pick
12 crop_right=200, # Samples after P-pick
13 allowed_labels=[0, 1, 2] # Allowed labels (0: Up, 1: Down, 2: Unknown)
14)
Data Format
Waveforms are stored in HDF5 files with the following structure:
waveforms.hdf5
├── X # Waveform data (N_samples, N_channels)
├── Y # P-value labels (N_samples,)
├── Z # Clarity (only required for ditingmotion)
├── metadata # Additional metadata (optional)
└── ...
Label Encoding
0: Up (positive polarity)
1: Down (negative polarity)
2: Unknown
DataLoader
Create a PyTorch DataLoader for training:
1loader = dataset.get_dataloader(
2 batch_size=1024,
3 num_workers=4,
4 shuffle=True,
5 pin_memory=True
6)
Data Inspection
Basic Statistics
1from seispolarity import WaveformDataset
2
3dataset = WaveformDataset(path="data.hdf5", name="SCSN")
4
5# Get dataset statistics
6print(f"Total samples: {len(dataset)}")
7print(f"Label distribution: {dataset.label_distribution}")
8print(f"Waveform shape: {dataset.waveform_shape}")
Multi-Dataset Training
Combine multiple datasets:
1from seispolarity import MultiWaveformDataset
2
3# Create multiple datasets
4dataset1 = WaveformDataset(path="scsn.hdf5", name="SCSN")
5dataset2 = WaveformDataset(path="txed.hdf5", name="Txed")
6
7# Combine them
8combined = MultiWaveformDataset([dataset1, dataset2])
Balanced Sampling
For datasets with label imbalance, use balanced sampling:
1from seispolarity import BalancedPolarityGenerator
2
3generator = BalancedPolarityGenerator(
4 dataset,
5 strategy="polarity_inversion" # or "min_based"
6)
7loader = generator.get_dataloader(batch_size=256)
Download Locations
Datasets can be downloaded from:
Hugging Face:
https://huggingface.co/datasets/chuanjun1978/Seismic-AI-DataModelScope:
https://www.modelscope.cn/datasets/chuanjun/Seismic-AI-Data/(recommended for users in China)
For more details, see the Installation Guide.