Package org.openrefine.runners.local.pll
Class PairPLL<K,V>
- java.lang.Object
-
- org.openrefine.runners.local.pll.PLL<Tuple2<K,V>>
-
- org.openrefine.runners.local.pll.PairPLL<K,V>
-
- Type Parameters:
K- the type of keys in the collectionV- the type of values in the collection
public class PairPLL<K,V> extends PLL<Tuple2<K,V>>
Adds additional methods for PLLs of keyed collections. The supplied Partitioner determines in which partition an element should be found, depending on its key.- Author:
- Antonin Delpeuch
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.openrefine.runners.local.pll.PLL
PLL.LastFlush, PLL.PLLExecutionError
-
-
Field Summary
Fields Modifier and Type Field Description protected Optional<Partitioner<K>>partitionerprotected PLL<Tuple2<K,V>>pll-
Fields inherited from class org.openrefine.runners.local.pll.PLL
cachedPartitions, context, id, name
-
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static <T> PairPLL<Long,T>assumeIndexed(PairPLL<Long,T> pairPLL, long totalRowCount)Assuming that the keys of the PairPLL are indices, deduce the partition sizes from the first element of each partition and the total number of elements, creating an appropriate RangePartitioner and adding it to the PLL.static <T> PairPLL<Long,T>assumeSorted(PairPLL<Long,T> pairPLL)Assumes that a PLL is sorted by key and derive the appropriate partitioner for it.ProgressingFuture<Void>cacheAsync()Prevent double caching from this PairPLL wrapper and the underlying PLL by defering the caching to the child.protected CloseableIterator<Tuple2<K,V>>compute(Partition partition)Iterate over the elements of the given partition.io.vavr.collection.Array<Long>computePartitionSizes()PairPLL<K,V>dropFirstElements(long n)Drops the first n elements at the beginning of the collection.PairPLL<K,V>dropLastElements(long n)Drops the first n elements at the end of the collection.PairPLL<K,V>filter(Predicate<? super Tuple2<K,V>> predicate)Derives a new PLL by filtering the collection to only contain elements which match the supplied predicate.<W> PairPLL<K,Tuple2<V,W>>fullJoinOrdered(PairPLL<K,W> other, Comparator<K> comparator)Assuming both PairPLLs are ordered by key, and each key appears at most once in each dataset, returns an ordered PairPLL with the full (outer) join of both PLLs.protected static <K,V>
List<Tuple2<K,V>>gatherElementsBefore(K upperBound, int limit, CloseableIterator<Tuple2<K,V>> stream, Comparator<K> comparator)Returns the last n elements whose key is strictly less than the supplied upper bound.io.vavr.collection.Array<V>get(K key)Returns the list of elements of the PLL indexed by the given key.io.vavr.collection.Array<Tuple2<K,V>>getByKeys(Set<K> keys)Returns the list of elements whose keys match one of the supplied keys.List<PLL<?>>getParents()Returns directly the parents of the underlying PLL of this PairPLL, so that the PairPLL wrapper is transparent in query trees.Optional<Partitioner<K>>getPartitioner()io.vavr.collection.Array<? extends Partition>getPartitions()QueryTreegetQueryTree()List<Tuple2<K,V>>getRangeAfter(K from, int limit, Comparator<K> comparator)Returns the list of elements starting at the given key and for up to the given number of elements.List<Tuple2<K,V>>getRangeBefore(K upperBound, int limit, Comparator<K> comparator)Returns the list of elements ending at the given key (excluded) and for up to the given number of elements.booleanhasCachedPartitionSizes()Is this PLL aware of the size of its partitions?<W> PairPLL<K,Tuple2<V,W>>innerJoinOrdered(PairPLL<K,W> other, Comparator<K> comparator)Assuming both PairPLLs are ordered by key, and each key appears at most once in each dataset, returns an ordered PairPLL with the inner join of both PLLs.booleanisCached()Prevent double caching from this PairPLL wrapper and the underlying PLL by defering the caching to the child.CloseableIterator<Tuple2<K,V>>iterate(Partition partition)Iterate over the elements of the given partition.PLL<K>keys()Returns a PLL of the keys contained in this collection, in the same order.<W> PairPLL<K,Tuple2<V,W>>leftJoinOrdered(PairPLL<K,W> other, Comparator<K> comparator)Assuming both PairPLLs are ordered by key, and each key appears at most once in each dataset, returns an ordered PairPLL with the left join of both PLLs.PairPLL<K,V>limitPartitions(long limit)Limit each partition to contain only their first N elements.<W> PairPLL<K,W>mapValues(BiFunction<K,V,W> mapFunction, String mapDescription)Returns a PLL obtained by mapping each element and preserving the indexing.PairPLL<K,V>retainPartitions(List<Integer> partitionIndices)Only retain partitions designated by the given list of indices.<W> PairPLL<K,Tuple2<V,W>>rightJoinOrdered(PairPLL<K,W> other, Comparator<K> comparator)Assuming both PairPLLs are ordered by key, and each key appears at most once in each dataset, returns an ordered PairPLL with the right join of both PLLs.CloseableIterator<Tuple2<K,V>>streamBetweenKeys(Optional<K> from, Optional<K> upTo, Comparator<K> comparator)Iterates over the elements of this PLL between the given keys, where both boundaries are optional.CloseableIterator<Tuple2<K,V>>streamBetweenKeys(K from, K upTo, Comparator<K> comparator)Iterates over the elements of this PLL from the given key (inclusive) and up to the other given key (exclusive).CloseableIterator<Tuple2<K,V>>streamFromKey(K from, Comparator<K> comparator)Iterates over the elements of this PLL starting from a given key (inclusive).CloseableIterator<Tuple2<K,V>>streamUpToKey(K upTo, Comparator<K> comparator)Iterates over the elements of this PLL up to the given key (exclusive).PLL<Tuple2<K,V>>toPLL()Returns the underlying PLL, discarding any partitioner.StringtoString()voiduncache()Unloads the partition contents from memoryPLL<V>values()Returns a PLL of the values contained in this collection, in the same order.PairPLL<K,V>withCachedPartitionSizes(io.vavr.collection.Array<Long> newCachedPartitionSizes)Returns a copy of this PairPLL with the given partition sizes, when they are externally known.PairPLL<K,V>withPartitioner(Optional<Partitioner<K>> partitioner)Returns a copy of this PairPLL with a changed partitioner.-
Methods inherited from class org.openrefine.runners.local.pll.PLL
aggregate, batchPartitions, collect, collectPartitionsAsync, concatenate, concatenate, count, flatMap, getContext, getId, getPartitionSizes, isEmpty, iterateFromPartition, iterator, map, mapPartitions, mapToPair, mapToPair, numPartitions, runOnPartitions, runOnPartitions, runOnPartitionsAsync, runOnPartitionsAsync, runOnPartitionsWithoutInterruption, runOnPartitionsWithoutInterruption, saveAsTextFile, saveAsTextFileAsync, scanMap, scanMapStream, sort, take, writeOriginalPartition, writePartition, writePlannedPartition, zipWithIndex
-
-
-
-
Method Detail
-
getPartitioner
public Optional<Partitioner<K>> getPartitioner()
- Returns:
- the partitioner used to locate elements by key
-
computePartitionSizes
public io.vavr.collection.Array<Long> computePartitionSizes()
- Overrides:
computePartitionSizesin classPLL<Tuple2<K,V>>
-
hasCachedPartitionSizes
public boolean hasCachedPartitionSizes()
Description copied from class:PLLIs this PLL aware of the size of its partitions?- Overrides:
hasCachedPartitionSizesin classPLL<Tuple2<K,V>>
-
keys
public PLL<K> keys()
Returns a PLL of the keys contained in this collection, in the same order.- Returns:
-
values
public PLL<V> values()
Returns a PLL of the values contained in this collection, in the same order.- Returns:
-
mapValues
public <W> PairPLL<K,W> mapValues(BiFunction<K,V,W> mapFunction, String mapDescription)
Returns a PLL obtained by mapping each element and preserving the indexing. If a partitioner is set on this PLL, it will also be used by the returned PLL.- Type Parameters:
W-- Parameters:
mapFunction- the function to apply on each elementmapDescription- a short description of the function, for debugging purposes- Returns:
-
get
public io.vavr.collection.Array<V> get(K key)
Returns the list of elements of the PLL indexed by the given key. This operation will be more efficient if a partitioner is available, making it possible to scan the relevant partition only. Furthermore, if the PLL is known to be sorted (which happens when it bears aRangePartitioner), then we only scan the relevant partition up to the given key and do not scan the greater keys.
-
getRangeAfter
public List<Tuple2<K,V>> getRangeAfter(K from, int limit, Comparator<K> comparator)
Returns the list of elements starting at the given key and for up to the given number of elements. This assumes that the PLL is sorted by keys.- Parameters:
from- the first key to returnlimit- the maximum number of elements to returncomparator- the ordering on K that is assumed on the PLL- Returns:
-
getRangeBefore
public List<Tuple2<K,V>> getRangeBefore(K upperBound, int limit, Comparator<K> comparator)
Returns the list of elements ending at the given key (excluded) and for up to the given number of elements. This assumes that the PLL is sorted by keys.- Parameters:
upperBound- the least key not to returnlimit- the maximum number of elements to returncomparator- the ordering on K that is assumed on the PLL- Returns:
- TODO this could be optimized further, when the partitions are cached in memory, we can iterate from them in reverse
-
gatherElementsBefore
protected static <K,V> List<Tuple2<K,V>> gatherElementsBefore(K upperBound, int limit, CloseableIterator<Tuple2<K,V>> stream, Comparator<K> comparator)
Returns the last n elements whose key is strictly less than the supplied upper bound.- Parameters:
stream- the stream to take the elements from, which is assumed to be in increasing order
-
getByKeys
public io.vavr.collection.Array<Tuple2<K,V>> getByKeys(Set<K> keys)
Returns the list of elements whose keys match one of the supplied keys.- Parameters:
keys- the keys to look up- Returns:
- the list of elements in the order they appear in the PLL
-
streamFromKey
public CloseableIterator<Tuple2<K,V>> streamFromKey(K from, Comparator<K> comparator)
Iterates over the elements of this PLL starting from a given key (inclusive). This assumes that the PLL is sorted.- Parameters:
from- the first key to start iterating fromcomparator- the order used to compare the keys- Returns:
- a streams which starts on the first element whose key is greater or equal to the provided one
-
streamUpToKey
public CloseableIterator<Tuple2<K,V>> streamUpToKey(K upTo, Comparator<K> comparator)
Iterates over the elements of this PLL up to the given key (exclusive). This assumes that the PLL is sorted.- Parameters:
upTo- the key to stop iterating atcomparator- the order used to compare the keys- Returns:
-
streamBetweenKeys
public CloseableIterator<Tuple2<K,V>> streamBetweenKeys(K from, K upTo, Comparator<K> comparator)
Iterates over the elements of this PLL from the given key (inclusive) and up to the other given key (exclusive). This assumes that the PLL is sorted.- Parameters:
from- the key to start iterating fromupTo- the key to stop iterating atcomparator- the order used to compare the keys- Returns:
-
streamBetweenKeys
public CloseableIterator<Tuple2<K,V>> streamBetweenKeys(Optional<K> from, Optional<K> upTo, Comparator<K> comparator)
Iterates over the elements of this PLL between the given keys, where both boundaries are optional. This assumes that the PLL is sorted.- Parameters:
from-upTo-comparator-- Returns:
-
filter
public PairPLL<K,V> filter(Predicate<? super Tuple2<K,V>> predicate)
Description copied from class:PLLDerives a new PLL by filtering the collection to only contain elements which match the supplied predicate. The predicate is evaluated lazily, so it can be called multiple times on the same element, depending on the actions called on the returned PLL.
-
limitPartitions
public PairPLL<K,V> limitPartitions(long limit)
Description copied from class:PLLLimit each partition to contain only their first N elements.- Overrides:
limitPartitionsin classPLL<Tuple2<K,V>>- Parameters:
limit- the maximum number of items per partition- Returns:
-
retainPartitions
public PairPLL<K,V> retainPartitions(List<Integer> partitionIndices)
Description copied from class:PLLOnly retain partitions designated by the given list of indices. The other partitions are dropped from the PLL. The partitions in the resulting PLL are ordered according to the list of indices supplied.- Overrides:
retainPartitionsin classPLL<Tuple2<K,V>>- Parameters:
partitionIndices- the indices of the partitions to retain
-
dropFirstElements
public PairPLL<K,V> dropFirstElements(long n)
Drops the first n elements at the beginning of the collection. This also adapts any partitioner to work on the cropped collection.- Overrides:
dropFirstElementsin classPLL<Tuple2<K,V>>- Parameters:
n- the number of elements to remove- Returns:
-
dropLastElements
public PairPLL<K,V> dropLastElements(long n)
Drops the first n elements at the end of the collection. This also adapts any partitioner to work on the cropped collection.- Overrides:
dropLastElementsin classPLL<Tuple2<K,V>>- Parameters:
n- the number of elements to remove- Returns:
-
compute
protected CloseableIterator<Tuple2<K,V>> compute(Partition partition)
Description copied from class:PLLIterate over the elements of the given partition. This is the method that should be implemented by subclasses. As this method forces computation, ignoring any caching, consumers should not call it directly but rather usePLL.iterate(Partition). Once the iterator is not needed anymore, it should be closed. This makes it possible to release the underlying resources supporting it, such as open files or sockets.
-
iterate
public CloseableIterator<Tuple2<K,V>> iterate(Partition partition)
Description copied from class:PLLIterate over the elements of the given partition. If the contents of this PLL have been cached, this will iterate from the cache instead.
-
getPartitions
public io.vavr.collection.Array<? extends Partition> getPartitions()
- Specified by:
getPartitionsin classPLL<Tuple2<K,V>>- Returns:
- the partitions in this list
-
assumeIndexed
public static <T> PairPLL<Long,T> assumeIndexed(PairPLL<Long,T> pairPLL, long totalRowCount)
Assuming that the keys of the PairPLL are indices, deduce the partition sizes from the first element of each partition and the total number of elements, creating an appropriate RangePartitioner and adding it to the PLL.
If the total row count is unknown (negative) then partition sizes are not inferred.
Note: this method is static for type-checking purposes (it can only apply to a PairPLL keyed by Long).- Parameters:
pairPLL- the new PLL with the derived partitionertotalRowCount- the known row count- Returns:
-
assumeSorted
public static <T> PairPLL<Long,T> assumeSorted(PairPLL<Long,T> pairPLL)
Assumes that a PLL is sorted by key and derive the appropriate partitioner for it.- Type Parameters:
T-- Parameters:
pairPLL-- Returns:
-
withPartitioner
public PairPLL<K,V> withPartitioner(Optional<Partitioner<K>> partitioner)
Returns a copy of this PairPLL with a changed partitioner.- Parameters:
partitioner-- Returns:
-
withCachedPartitionSizes
public PairPLL<K,V> withCachedPartitionSizes(io.vavr.collection.Array<Long> newCachedPartitionSizes)
Returns a copy of this PairPLL with the given partition sizes, when they are externally known.- Overrides:
withCachedPartitionSizesin classPLL<Tuple2<K,V>>- Returns:
-
getParents
public List<PLL<?>> getParents()
Returns directly the parents of the underlying PLL of this PairPLL, so that the PairPLL wrapper is transparent in query trees. This is because the PairPLL class is a bare wrapper on top of PLL for type-checking purposes, such as to hold a partitioner, without adding any computation on top of the underlying PLL.- Specified by:
getParentsin classPLL<Tuple2<K,V>>- See Also:
PLL.getQueryTree()
-
getQueryTree
public QueryTree getQueryTree()
- Overrides:
getQueryTreein classPLL<Tuple2<K,V>>- Returns:
- a tree-based representation of the dependencies of this PLL.
-
isCached
public boolean isCached()
Prevent double caching from this PairPLL wrapper and the underlying PLL by defering the caching to the child.
-
cacheAsync
public ProgressingFuture<Void> cacheAsync()
Prevent double caching from this PairPLL wrapper and the underlying PLL by defering the caching to the child.- Overrides:
cacheAsyncin classPLL<Tuple2<K,V>>
-
uncache
public void uncache()
Description copied from class:PLLUnloads the partition contents from memory
-
innerJoinOrdered
public <W> PairPLL<K,Tuple2<V,W>> innerJoinOrdered(PairPLL<K,W> other, Comparator<K> comparator)
Assuming both PairPLLs are ordered by key, and each key appears at most once in each dataset, returns an ordered PairPLL with the inner join of both PLLs. This resulting PLL is partitioned with the same partitioner as the left PLL (the instance on which this method is called).
-
leftJoinOrdered
public <W> PairPLL<K,Tuple2<V,W>> leftJoinOrdered(PairPLL<K,W> other, Comparator<K> comparator)
Assuming both PairPLLs are ordered by key, and each key appears at most once in each dataset, returns an ordered PairPLL with the left join of both PLLs. This resulting PLL is partitioned with the same partitioner as the left PLL (the instance on which this method is called).
-
rightJoinOrdered
public <W> PairPLL<K,Tuple2<V,W>> rightJoinOrdered(PairPLL<K,W> other, Comparator<K> comparator)
Assuming both PairPLLs are ordered by key, and each key appears at most once in each dataset, returns an ordered PairPLL with the right join of both PLLs. This resulting PLL is partitioned with the same partitioner as the left PLL (the instance on which this method is called).
-
fullJoinOrdered
public <W> PairPLL<K,Tuple2<V,W>> fullJoinOrdered(PairPLL<K,W> other, Comparator<K> comparator)
Assuming both PairPLLs are ordered by key, and each key appears at most once in each dataset, returns an ordered PairPLL with the full (outer) join of both PLLs. This resulting PLL is partitioned with the same partitioner as the left PLL (the instance on which this method is called).
-
-