Class PairPLL<K,​V>

  • Type Parameters:
    K - the type of keys in the collection
    V - the type of values in the collection

    public class PairPLL<K,​V>
    extends PLL<Tuple2<K,​V>>
    Adds additional methods for PLLs of keyed collections. The supplied Partitioner determines in which partition an element should be found, depending on its key.
    Author:
    Antonin Delpeuch
    • Method Detail

      • getPartitioner

        public Optional<Partitioner<K>> getPartitioner()
        Returns:
        the partitioner used to locate elements by key
      • hasCachedPartitionSizes

        public boolean hasCachedPartitionSizes()
        Description copied from class: PLL
        Is this PLL aware of the size of its partitions?
        Overrides:
        hasCachedPartitionSizes in class PLL<Tuple2<K,​V>>
      • keys

        public PLL<K> keys()
        Returns a PLL of the keys contained in this collection, in the same order.
        Returns:
      • values

        public PLL<V> values()
        Returns a PLL of the values contained in this collection, in the same order.
        Returns:
      • mapValues

        public <W> PairPLL<K,​W> mapValues​(BiFunction<K,​V,​W> mapFunction,
                                                String mapDescription)
        Returns a PLL obtained by mapping each element and preserving the indexing. If a partitioner is set on this PLL, it will also be used by the returned PLL.
        Type Parameters:
        W -
        Parameters:
        mapFunction - the function to apply on each element
        mapDescription - a short description of the function, for debugging purposes
        Returns:
      • get

        public io.vavr.collection.Array<V> get​(K key)
        Returns the list of elements of the PLL indexed by the given key. This operation will be more efficient if a partitioner is available, making it possible to scan the relevant partition only. Furthermore, if the PLL is known to be sorted (which happens when it bears a RangePartitioner), then we only scan the relevant partition up to the given key and do not scan the greater keys.
      • getRangeAfter

        public List<Tuple2<K,​V>> getRangeAfter​(K from,
                                                     int limit,
                                                     Comparator<K> comparator)
        Returns the list of elements starting at the given key and for up to the given number of elements. This assumes that the PLL is sorted by keys.
        Parameters:
        from - the first key to return
        limit - the maximum number of elements to return
        comparator - the ordering on K that is assumed on the PLL
        Returns:
      • getRangeBefore

        public List<Tuple2<K,​V>> getRangeBefore​(K upperBound,
                                                      int limit,
                                                      Comparator<K> comparator)
        Returns the list of elements ending at the given key (excluded) and for up to the given number of elements. This assumes that the PLL is sorted by keys.
        Parameters:
        upperBound - the least key not to return
        limit - the maximum number of elements to return
        comparator - the ordering on K that is assumed on the PLL
        Returns:
        TODO this could be optimized further, when the partitions are cached in memory, we can iterate from them in reverse
      • gatherElementsBefore

        protected static <K,​V> List<Tuple2<K,​V>> gatherElementsBefore​(K upperBound,
                                                                                  int limit,
                                                                                  CloseableIterator<Tuple2<K,​V>> stream,
                                                                                  Comparator<K> comparator)
        Returns the last n elements whose key is strictly less than the supplied upper bound.
        Parameters:
        stream - the stream to take the elements from, which is assumed to be in increasing order
      • getByKeys

        public io.vavr.collection.Array<Tuple2<K,​V>> getByKeys​(Set<K> keys)
        Returns the list of elements whose keys match one of the supplied keys.
        Parameters:
        keys - the keys to look up
        Returns:
        the list of elements in the order they appear in the PLL
      • streamFromKey

        public CloseableIterator<Tuple2<K,​V>> streamFromKey​(K from,
                                                                  Comparator<K> comparator)
        Iterates over the elements of this PLL starting from a given key (inclusive). This assumes that the PLL is sorted.
        Parameters:
        from - the first key to start iterating from
        comparator - the order used to compare the keys
        Returns:
        a streams which starts on the first element whose key is greater or equal to the provided one
      • streamUpToKey

        public CloseableIterator<Tuple2<K,​V>> streamUpToKey​(K upTo,
                                                                  Comparator<K> comparator)
        Iterates over the elements of this PLL up to the given key (exclusive). This assumes that the PLL is sorted.
        Parameters:
        upTo - the key to stop iterating at
        comparator - the order used to compare the keys
        Returns:
      • streamBetweenKeys

        public CloseableIterator<Tuple2<K,​V>> streamBetweenKeys​(K from,
                                                                      K upTo,
                                                                      Comparator<K> comparator)
        Iterates over the elements of this PLL from the given key (inclusive) and up to the other given key (exclusive). This assumes that the PLL is sorted.
        Parameters:
        from - the key to start iterating from
        upTo - the key to stop iterating at
        comparator - the order used to compare the keys
        Returns:
      • streamBetweenKeys

        public CloseableIterator<Tuple2<K,​V>> streamBetweenKeys​(Optional<K> from,
                                                                      Optional<K> upTo,
                                                                      Comparator<K> comparator)
        Iterates over the elements of this PLL between the given keys, where both boundaries are optional. This assumes that the PLL is sorted.
        Parameters:
        from -
        upTo -
        comparator -
        Returns:
      • filter

        public PairPLL<K,​V> filter​(Predicate<? super Tuple2<K,​V>> predicate)
        Description copied from class: PLL
        Derives a new PLL by filtering the collection to only contain elements which match the supplied predicate. The predicate is evaluated lazily, so it can be called multiple times on the same element, depending on the actions called on the returned PLL.
        Overrides:
        filter in class PLL<Tuple2<K,​V>>
        Returns:
      • limitPartitions

        public PairPLL<K,​V> limitPartitions​(long limit)
        Description copied from class: PLL
        Limit each partition to contain only their first N elements.
        Overrides:
        limitPartitions in class PLL<Tuple2<K,​V>>
        Parameters:
        limit - the maximum number of items per partition
        Returns:
      • retainPartitions

        public PairPLL<K,​V> retainPartitions​(List<Integer> partitionIndices)
        Description copied from class: PLL
        Only retain partitions designated by the given list of indices. The other partitions are dropped from the PLL. The partitions in the resulting PLL are ordered according to the list of indices supplied.
        Overrides:
        retainPartitions in class PLL<Tuple2<K,​V>>
        Parameters:
        partitionIndices - the indices of the partitions to retain
      • dropFirstElements

        public PairPLL<K,​V> dropFirstElements​(long n)
        Drops the first n elements at the beginning of the collection. This also adapts any partitioner to work on the cropped collection.
        Overrides:
        dropFirstElements in class PLL<Tuple2<K,​V>>
        Parameters:
        n - the number of elements to remove
        Returns:
      • dropLastElements

        public PairPLL<K,​V> dropLastElements​(long n)
        Drops the first n elements at the end of the collection. This also adapts any partitioner to work on the cropped collection.
        Overrides:
        dropLastElements in class PLL<Tuple2<K,​V>>
        Parameters:
        n - the number of elements to remove
        Returns:
      • compute

        protected CloseableIterator<Tuple2<K,​V>> compute​(Partition partition)
        Description copied from class: PLL
        Iterate over the elements of the given partition. This is the method that should be implemented by subclasses. As this method forces computation, ignoring any caching, consumers should not call it directly but rather use PLL.iterate(Partition). Once the iterator is not needed anymore, it should be closed. This makes it possible to release the underlying resources supporting it, such as open files or sockets.
        Specified by:
        compute in class PLL<Tuple2<K,​V>>
        Parameters:
        partition - the partition to iterate over
        Returns:
      • iterate

        public CloseableIterator<Tuple2<K,​V>> iterate​(Partition partition)
        Description copied from class: PLL
        Iterate over the elements of the given partition. If the contents of this PLL have been cached, this will iterate from the cache instead.
        Overrides:
        iterate in class PLL<Tuple2<K,​V>>
        Parameters:
        partition - the partition to iterate over
        Returns:
      • getPartitions

        public io.vavr.collection.Array<? extends Partition> getPartitions()
        Specified by:
        getPartitions in class PLL<Tuple2<K,​V>>
        Returns:
        the partitions in this list
      • toPLL

        public PLL<Tuple2<K,​V>> toPLL()
        Returns the underlying PLL, discarding any partitioner.
      • assumeIndexed

        public static <T> PairPLL<Long,​T> assumeIndexed​(PairPLL<Long,​T> pairPLL,
                                                              long totalRowCount)
        Assuming that the keys of the PairPLL are indices, deduce the partition sizes from the first element of each partition and the total number of elements, creating an appropriate RangePartitioner and adding it to the PLL.
        If the total row count is unknown (negative) then partition sizes are not inferred.
        Note: this method is static for type-checking purposes (it can only apply to a PairPLL keyed by Long).
        Parameters:
        pairPLL - the new PLL with the derived partitioner
        totalRowCount - the known row count
        Returns:
      • assumeSorted

        public static <T> PairPLL<Long,​T> assumeSorted​(PairPLL<Long,​T> pairPLL)
        Assumes that a PLL is sorted by key and derive the appropriate partitioner for it.
        Type Parameters:
        T -
        Parameters:
        pairPLL -
        Returns:
      • withPartitioner

        public PairPLL<K,​V> withPartitioner​(Optional<Partitioner<K>> partitioner)
        Returns a copy of this PairPLL with a changed partitioner.
        Parameters:
        partitioner -
        Returns:
      • withCachedPartitionSizes

        public PairPLL<K,​V> withCachedPartitionSizes​(io.vavr.collection.Array<Long> newCachedPartitionSizes)
        Returns a copy of this PairPLL with the given partition sizes, when they are externally known.
        Overrides:
        withCachedPartitionSizes in class PLL<Tuple2<K,​V>>
        Returns:
      • getParents

        public List<PLL<?>> getParents()
        Returns directly the parents of the underlying PLL of this PairPLL, so that the PairPLL wrapper is transparent in query trees. This is because the PairPLL class is a bare wrapper on top of PLL for type-checking purposes, such as to hold a partitioner, without adding any computation on top of the underlying PLL.
        Specified by:
        getParents in class PLL<Tuple2<K,​V>>
        See Also:
        PLL.getQueryTree()
      • getQueryTree

        public QueryTree getQueryTree()
        Overrides:
        getQueryTree in class PLL<Tuple2<K,​V>>
        Returns:
        a tree-based representation of the dependencies of this PLL.
      • isCached

        public boolean isCached()
        Prevent double caching from this PairPLL wrapper and the underlying PLL by defering the caching to the child.
        Overrides:
        isCached in class PLL<Tuple2<K,​V>>
      • cacheAsync

        public ProgressingFuture<Void> cacheAsync()
        Prevent double caching from this PairPLL wrapper and the underlying PLL by defering the caching to the child.
        Overrides:
        cacheAsync in class PLL<Tuple2<K,​V>>
      • uncache

        public void uncache()
        Description copied from class: PLL
        Unloads the partition contents from memory
        Overrides:
        uncache in class PLL<Tuple2<K,​V>>
      • innerJoinOrdered

        public <W> PairPLL<K,​Tuple2<V,​W>> innerJoinOrdered​(PairPLL<K,​W> other,
                                                                       Comparator<K> comparator)
        Assuming both PairPLLs are ordered by key, and each key appears at most once in each dataset, returns an ordered PairPLL with the inner join of both PLLs. This resulting PLL is partitioned with the same partitioner as the left PLL (the instance on which this method is called).
      • leftJoinOrdered

        public <W> PairPLL<K,​Tuple2<V,​W>> leftJoinOrdered​(PairPLL<K,​W> other,
                                                                      Comparator<K> comparator)
        Assuming both PairPLLs are ordered by key, and each key appears at most once in each dataset, returns an ordered PairPLL with the left join of both PLLs. This resulting PLL is partitioned with the same partitioner as the left PLL (the instance on which this method is called).
      • rightJoinOrdered

        public <W> PairPLL<K,​Tuple2<V,​W>> rightJoinOrdered​(PairPLL<K,​W> other,
                                                                       Comparator<K> comparator)
        Assuming both PairPLLs are ordered by key, and each key appears at most once in each dataset, returns an ordered PairPLL with the right join of both PLLs. This resulting PLL is partitioned with the same partitioner as the left PLL (the instance on which this method is called).
      • fullJoinOrdered

        public <W> PairPLL<K,​Tuple2<V,​W>> fullJoinOrdered​(PairPLL<K,​W> other,
                                                                      Comparator<K> comparator)
        Assuming both PairPLLs are ordered by key, and each key appears at most once in each dataset, returns an ordered PairPLL with the full (outer) join of both PLLs. This resulting PLL is partitioned with the same partitioner as the left PLL (the instance on which this method is called).