Solar activity classification based on Mg II spectra: towards classification on compressed data. (arXiv:2009.07156v1 [astro-ph.SR])
<a href="http://arxiv.org/find/astro-ph/1/au:+Ivanov_S/0/1/0/all/0/1">Sergey Ivanov</a>, <a href="http://arxiv.org/find/astro-ph/1/au:+Tsizh_M/0/1/0/all/0/1">Maksym Tsizh</a>, <a href="http://arxiv.org/find/astro-ph/1/au:+Ullmann_D/0/1/0/all/0/1">Denis Ullmann</a>, <a href="http://arxiv.org/find/astro-ph/1/au:+Panos_B/0/1/0/all/0/1">Brandon Panos</a>, <a href="http://arxiv.org/find/astro-ph/1/au:+Voloshynovskiy_S/0/1/0/all/0/1">Slava Voloshynovskiy</a>

Although large volumes of solar data are available for investigation, the
majority of these data remain unlabeled and are therefore unsuited to modern
supervised machine learning methods. Having a way to automatically classify
spectra into categories related to the degree of solar activity is highly
desirable and will assist and speed up future research efforts in solar
physics. At the same time, the large volume of raw observational data is a
serious bottleneck for machine learning, requiring powerful computational
means. Additionally, the raw data communication imposes restrictions on real
time data observations and requires considerable bandwidth and energy for the
onboard solar observation systems. To solve mentioned issues, we propose a
framework to classify solar activity on compressed data. To this end, we used a
labeling scheme from a clustering technique in conjunction with several machine
learning algorithms to categorize Mg II spectra measured by NASA’s satellite
IRIS into groups characterizing solar activity. Our training data set is a
human-annotated list of 85 IRIS observations containing 29097 frames in total
or equivalently 9 million Mg II spectra. The annotated types of Solar activity
are active region, pre-flare activity, Solar flare, Sunspot and quiet Sun. We
used the vector quantization to compress these data before training
classifiers. We found that the XGBoost classifier produced the most accurate
results on the compressed data, yielding over a 0.95 prediction rate, and
outperforming other ML methods: convolution neural networks, K-nearest
neighbors, naive Bayes classifiers and support vector machines. A principle
finding of this research is that the classification performance on compressed
and uncompressed data is comparable under our particular architecture, implying
the possibility of large compression rates for relatively low degrees of
information loss.

Although large volumes of solar data are available for investigation, the
majority of these data remain unlabeled and are therefore unsuited to modern
supervised machine learning methods. Having a way to automatically classify
spectra into categories related to the degree of solar activity is highly
desirable and will assist and speed up future research efforts in solar
physics. At the same time, the large volume of raw observational data is a
serious bottleneck for machine learning, requiring powerful computational
means. Additionally, the raw data communication imposes restrictions on real
time data observations and requires considerable bandwidth and energy for the
onboard solar observation systems. To solve mentioned issues, we propose a
framework to classify solar activity on compressed data. To this end, we used a
labeling scheme from a clustering technique in conjunction with several machine
learning algorithms to categorize Mg II spectra measured by NASA’s satellite
IRIS into groups characterizing solar activity. Our training data set is a
human-annotated list of 85 IRIS observations containing 29097 frames in total
or equivalently 9 million Mg II spectra. The annotated types of Solar activity
are active region, pre-flare activity, Solar flare, Sunspot and quiet Sun. We
used the vector quantization to compress these data before training
classifiers. We found that the XGBoost classifier produced the most accurate
results on the compressed data, yielding over a 0.95 prediction rate, and
outperforming other ML methods: convolution neural networks, K-nearest
neighbors, naive Bayes classifiers and support vector machines. A principle
finding of this research is that the classification performance on compressed
and uncompressed data is comparable under our particular architecture, implying
the possibility of large compression rates for relatively low degrees of
information loss.

http://arxiv.org/icons/sfx.gif