Class SortedSSTableWriter
- java.lang.Object
-
- org.apache.cassandra.spark.bulkwriter.SortedSSTableWriter
-
public class SortedSSTableWriter extends java.lang.ObjectSSTableWriter that expects sorted data
Note for implementor: the bulk writer always sort the data in entire spark partition before writing. One of the benefit is that the output sstables are sorted and non-overlapping. It allows Cassandra to perform optimization when importing those sstables, as they can be considered as a single large SSTable technically. You might want to introduce a SSTableWriter for unsorted data, say UnsortedSSTableWriter, and stop sorting the entire partition, i.e. repartitionAndSortWithinPartitions. By doing so, it eliminates the nice property of the output sstable being globally sorted and non-overlapping. Unless you can think of a better use case, we should stick with this SortedSSTableWriter
Threading Model:
This class has limited thread-safety guarantees:addRow(BigInteger, Map)andclose(BulkWriterContext)MUST be called from the same thread (typically the RecordWriter thread). These methods are NOT synchronized and must not be called concurrently.prepareSStablesToSend(BulkWriterContext, Set)MAY be called concurrently from background threads (viaStreamSession's executor service) and is synchronized to protect shared state.close(BulkWriterContext)is synchronized to prevent races with concurrentprepareSStablesToSend(BulkWriterContext, Set)calls.- Getter methods (
rowCount(),bytesWritten(),sstableCount()) may return stale values if called concurrently withprepareSStablesToSend(BulkWriterContext, Set)orclose(BulkWriterContext). They are only guaranteed accurate afterclose(BulkWriterContext)completes.
-
-
Field Summary
Fields Modifier and Type Field Description static java.lang.StringCASSANDRA_VERSION_PREFIX
-
Constructor Summary
Constructors Constructor Description SortedSSTableWriter(org.apache.cassandra.bridge.SSTableWriter tableWriter, java.nio.file.Path outDir, DigestAlgorithm digestAlgorithm, int partitionId)SortedSSTableWriter(BulkWriterContext writerContext, java.nio.file.Path outDir, DigestAlgorithm digestAlgorithm, int partitionId)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidaddRow(java.math.BigInteger token, java.util.Map<java.lang.String,java.lang.Object> boundValues)Add a row to be written.longbytesWritten()voidclose(BulkWriterContext writerContext)Closes this writer, flushes any remaining data, calculates digests, and validates all SSTables.java.util.Map<java.nio.file.Path,Digest>fileDigestMap()java.nio.file.PathgetOutDir()java.lang.StringgetPackageVersion(java.lang.String lowestCassandraVersion)com.google.common.collect.Range<java.math.BigInteger>getTokenRange()java.util.Map<java.nio.file.Path,Digest>prepareSStablesToSend(BulkWriterContext writerContext, java.util.Set<org.apache.cassandra.bridge.SSTableDescriptor> sstables)Prepares a set of SSTables to be sent to replicas by calculating digests and validating them.longrowCount()voidsetSSTablesProducedListener(java.util.function.Consumer<java.util.Set<org.apache.cassandra.bridge.SSTableDescriptor>> listener)intsstableCount()voidvalidateSSTables(BulkWriterContext writerContext)voidvalidateSSTables(BulkWriterContext writerContext, java.nio.file.Path outputDirectory, java.util.Set<java.nio.file.Path> dataFilePaths)Validate SSTables.
-
-
-
Field Detail
-
CASSANDRA_VERSION_PREFIX
public static final java.lang.String CASSANDRA_VERSION_PREFIX
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
SortedSSTableWriter
public SortedSSTableWriter(org.apache.cassandra.bridge.SSTableWriter tableWriter, java.nio.file.Path outDir, DigestAlgorithm digestAlgorithm, int partitionId)
-
SortedSSTableWriter
public SortedSSTableWriter(BulkWriterContext writerContext, java.nio.file.Path outDir, DigestAlgorithm digestAlgorithm, int partitionId)
-
-
Method Detail
-
getPackageVersion
@NotNull public java.lang.String getPackageVersion(java.lang.String lowestCassandraVersion)
-
addRow
public void addRow(java.math.BigInteger token, java.util.Map<java.lang.String,java.lang.Object> boundValues) throws java.io.IOExceptionAdd a row to be written.Threading: This method MUST be called from the same thread that calls
close(BulkWriterContext)(typically the RecordWriter thread). It is NOT thread-safe and must not be called concurrently with any other method on this instance.- Parameters:
token- the hashed token of the row's partition key. The value must be monotonically increasing in the subsequent calls.boundValues- bound values of the columns in the row- Throws:
java.io.IOException- I/O exception when adding the row
-
setSSTablesProducedListener
public void setSSTablesProducedListener(java.util.function.Consumer<java.util.Set<org.apache.cassandra.bridge.SSTableDescriptor>> listener)
-
rowCount
public long rowCount()
- Returns:
- the total number of rows written
-
bytesWritten
public long bytesWritten()
- Returns:
- the total number of bytes written
-
sstableCount
public int sstableCount()
- Returns:
- the total number of sstables written
-
prepareSStablesToSend
public java.util.Map<java.nio.file.Path,Digest> prepareSStablesToSend(@NotNull BulkWriterContext writerContext, java.util.Set<org.apache.cassandra.bridge.SSTableDescriptor> sstables) throws java.io.IOException
Prepares a set of SSTables to be sent to replicas by calculating digests and validating them.This method is called when SSTables are produced during the write process (before final close). It processes newly-produced SSTables, calculates their file digests, validates them, and updates the internal counters.
Threading: This method is thread-safe and may be called concurrently from background threads (e.g., from
DirectStreamSession.onSSTablesProduced(Set)via the executor service). It is synchronized to protect shared state (overallFileDigests,sstableCount,bytesWritten) from concurrent access withclose(BulkWriterContext).- Parameters:
writerContext- the bulk writer contextsstables- the set of SSTable descriptors to prepare- Returns:
- a map of file paths to their digests, or an empty map if the writer is already closed
- Throws:
java.io.IOException- if an I/O error occurs
-
close
public void close(BulkWriterContext writerContext) throws java.io.IOException
Closes this writer, flushes any remaining data, calculates digests, and validates all SSTables.This method performs the final flush of the SSTable writer, processes any SSTables that were not already handled by
prepareSStablesToSend(BulkWriterContext, Set), calculates their digests, and validates all written SSTables.Threading: This method MUST be called from the same thread that calls
addRow(BigInteger, Map)(typically the RecordWriter thread). It is synchronized to prevent races with concurrentprepareSStablesToSend(BulkWriterContext, Set)calls from background threads.This method is idempotent - calling it multiple times will return early after the first call completes.
- Parameters:
writerContext- the bulk writer context- Throws:
java.io.IOException- if an I/O error occurs during closing
-
validateSSTables
public void validateSSTables(@NotNull BulkWriterContext writerContext)
-
validateSSTables
public void validateSSTables(@NotNull BulkWriterContext writerContext, @NotNull java.nio.file.Path outputDirectory, @Nullable java.util.Set<java.nio.file.Path> dataFilePaths)Validate SSTables. If dataFilePaths is null, it finds all sstables under the output directory of the writer and validates them- Parameters:
outputDirectory- output directory of the sstable writerwriterContext- bulk writer contextdataFilePaths- paths of sstables (data file) to be validated. The argument is nullable. When it is null, it validates all sstables under the output directory.
-
getTokenRange
public com.google.common.collect.Range<java.math.BigInteger> getTokenRange()
-
getOutDir
public java.nio.file.Path getOutDir()
-
fileDigestMap
public java.util.Map<java.nio.file.Path,Digest> fileDigestMap()
- Returns:
- a view of the file digest map
-
-